Electronic Feedback in Large University Statistics Courses: The Longitudinal Effects of Quizzes on Motivation, Emotion, and Cognition 365841619X, 9783658416195

Digital tools and pedagogies in public higher education are unfolding their potential by providing large groups of stude

124 32 4MB

English Pages 436 [424] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Foreword
Acknowledgements
Contents
Abbreviations
List of Figures
List of Tables
1 Potentials of and Barriers to Teaching Statistics in Large Higher Education Lectures
1.1 The Incompatibility between the Relevance of Statistics for Personality Formation and Traditional Approaches to Teaching
1.2 Feedback and Flipped Teaching as Competence- and Autonomy-oriented Countermeasures to Improve Teaching Statistics
1.3 Prevalent Gaps in the Research on Improving the Teaching of Statistics by Means of Instructional Adaptations
1.4 Derived Research Objectives and Structure of the Thesis
2 Knowledge Acquisition and Transmission in the Domain of Statistics
2.1 Particularization of Statistical Reasoning to the Study Context in Depth and Breadth
2.1.1 Hierarchical Models of Statistics Reasoning and its Transition from Computational to Conceptual Understanding
2.1.2 Content Models of Statistics Reasoning and their Measurement
2.1.3 Comparison between a Preliminary Statistics Core Curriculum and the Content Covered in the Investigated Statistics Course
2.1.4 Implications from the Hierarchical and Content Models for the Measurement of Statistics Reasoning in the Present Study
2.2 Impediments to the Furtherance of Statistical Reasoning in the Context of Teaching and Potential Expedients
3 A Model for Reciprocal Interrelations between Feedback and Statistics Reasoning
3.1 Formative Feedback and Academic Achievement
3.1.1 The Significance of Formative Feedback for Academic Achievement in Theoretical and Empirical Research
3.1.2 Design Characteristics Moderating the Feedback-Achievement Relationship
3.2 Formative Feedback and Achievement Motivation
3.2.1 Broadening the Conceptualization of Statistical Reasoning to Encompass Achievement Motivation in the Uptake of Feedback
3.2.2 Feedback Models Incorporating Notions of Achievement Motivation
3.2.3 The Feedback-Related Achievement Motivation Processing Model
3.2.4 Approaching the Theoretical and Empirical Research on the Interrelations between Construct-Specific Expectancy-Value Appraisals and Feedback
3.2.5 Reciprocal Relations between Statistics Self-Efficacy and Achievement
3.2.6 Reciprocal Relations between Statistics Difficulty and Achievement
3.2.7 Reciprocal Relations between Statistics Interest/Utility Value and Achievement
3.2.8 Reciprocal Relations between Statistics Affect and Achievement
3.2.9 Reciprocal Relations between Statistics Effort and Achievement
3.3 Formative Feedback and Achievement Emotions
3.3.1 Broadening the Conceptualization of Statistical Reasoning to Encompass Achievement Emotions in the Uptake of Feedback
3.3.2 Motivational and Emotional Uptake of Feedback According to the Control-Value Theory of Achievement Emotions
3.3.3 Reciprocal Relations between Enjoyment, Hopelessness, and Achievement
3.3.4 Differential Perceptions of Achievement Emotions in In-class and Out-of-class Learning Contexts
3.4 Multiplicative Effects of Expectancy-Value Appraisals on Achievement Emotions and Performance
3.5 Average Development of Achievement Motivation and Emotion throughout a Semester
3.6 Intertwining the Expectancy- and Control-Value Theory to an Integrative Model of Motivational and Emotional Feedback Processing
4 Further Contextualization of Motivational and Emotional Feedback Processing
4.1 Variables and Contexts Considered Relevant for Feedback Processing in Statistics Education
4.2 The Prevalent Gender Differential in Statistics-related Motivational and Emotional Appraisals
4.2.1 Gender-related Mean Differences in Motivational and Emotional Appraisals in Theoretical and Empirical Research
4.2.2 Gender-related Differences in Feedback Processing in Theoretical and Empirical Research
4.3 The Shaping Role of Statistics-related Prior Knowledge in Motivational and Emotional Appraisals
4.3.1 Expertise-related Mean Differences in Motivational and Emotional Appraisals in Theoretical and Empirical Research
4.3.2 Expertise-related Differences in Feedback Processing in Theoretical and Empirical Research
4.4 The Flipped Classroom as a Potential Catalyst of the Motivational and Emotional uptake of Feedback
4.4.1 Defining Characteristics of a Flipped Classroom
4.4.2 Theoretical Evidence on Motivational and Emotional Feedback Processing in Flipped Classrooms
4.4.3 Empirical Evidence on Motivational and Emotional Feedback Processing in Flipped Classrooms
4.5 Broadening the Feedback Model to Account for Individual and Contextual Differences in the Uptake of Feedback
5 Empirical Basis
5.1 Analytical Method of Autoregressive Structural Equation Modeling
5.2 Underlying Circumstances of the Data Collection
5.2.1 The Traditional and Flipped Course Frameworks
5.2.2 Measurement Instruments
5.2.3 Longitudinal Assessment Framework and Assessment Methods
5.3 Quality Criteria of the Study Design and Measurement Instruments
5.3.1 Objectivity Evidence
5.3.2 Reliability and Validity Evidence Based on the Internal Structure
5.3.3 Validity Evidence Based on Test Content
5.3.4 Validity Evidence Based on Relations with other Variables
5.3.5 Further Relevant Validity Criteria
5.3.6 Implications From the Reliability and Validity Evidence for the further Analyses
5.4 Samples
5.4.1 Participants
5.4.2 Missing Values and Panel Mortality
5.4.3 Distribution of the Data
6 Evaluation of the Unmodified Measurement Models
6.1 Choice of an Appropriate Estimator
6.2 Specification of the Factor-indicator Effect Direction
6.3 Implications of the Construct Specification for the Model Evaluation
6.4 Global Goodness-of-fit of the Unmodified Measurement Models
6.5 Item-specific Analyses for the Unmodified Measurement Models
6.5.1 Indicator Reliabilities
6.5.2 Dimensionality of the Item Structure
6.6 Construct-level Analyses for the Original Measurement Models
6.6.1 Composite Reliabilities
6.6.2 Average Variance Extracted
6.6.3 Fornell-Larcker Criterion
7 Optimization of the Measurement Models
7.1 Expectancy-value Indicators Considered for Removal or Optimization
7.1.1 Expectancy Constructs
7.1.2 Value Constructs
7.1.3 Indicator-specific Expectancy-value Effects
7.2 Evaluation of the Modified Expectancy-value Constructs
7.2.1 AVE, Composite Reliability, Fornell-Larcker, and Factorial Structure
7.2.2 Global Goodness-of-fit for the Expectancy-value Constructs
7.2.3 Examination of the Assumed 6-Factorial Structure of the Expectancy-value Constructs
7.3 Evaluation of the Modified Achievement Emotion Constructs
7.3.1 Achievement Emotion Indicators Considered for Removal or Optimization
7.3.2 AVE, Composite Reliability, Fornell-Larcker, and Goodness-of-fit
7.4 Residual Correlations
7.5 Final Evaluation and Reconceptualization of the Modified Measurement Models
8 Results of the Longitudinal Study
8.1 Testing Measurement Invariance Across Time
8.2 Average Motivational and Emotional Trajectories throughout the Semester
8.3 Separate Reciprocal Causation Models
8.3.1 Modeling Approach and Goodness-of-Fits
8.3.2 Expectancy-Feedback Model
8.3.3 Value-Feedback Model
8.3.4 Emotion-Feedback Models
8.4 Empirical Modeling of the Integrative Model of Achievement Motivation and Emotion
8.4.1 The Need for Downsizing the Expectancy-Value Causation Model
8.4.2 Integrating Achievement Motivation and Emotion into Additive Control-Value Causation Models
8.4.3 Squared Multiple Correlation of the Integrative Models
8.4.4 The Examination of Multiplicative Effects within the Integrative Control-Value Causation Models
8.4.5 Visualization of Conditional Expectancy-Value Effects by Means of Simple Slopes
9 Multiple Group Causal Analyses
9.1 Operationalization of the Grouping Variables
9.2 Testing Measurement Invariance Across Groups
9.3 Group-specific Average Development Throughout the Semester
9.4 Group-specific Reciprocal Causation Models
9.4.1 Goodness-of-fit of Each Group-Specific Model
9.4.2 Design-specific Reciprocal Causative Models
9.4.3 Gender-specific Reciprocal Causative Models
9.4.4 Expertise-specific Reciprocal Causative Models
9.4.5 Implications and Follow-up Questions from the Multiple Group Analyses
9.5 Secondary Findings on Group-specific Moderation Effects
9.5.1 Group-specific Multiplicative Expectancy-value Effects
9.5.2 Design-specific Quiz Effect Depending on Gender and Prior Knowledge
9.5.3 The Moderating Effect of Achievement Emotions on the Expectancy-value Feedback Effects
10 Discussion and Conclusion
10.1 Synoptic Evaluation of the Hypotheses
10.1.1 Do Formative Achievement as well as Achievement Motivation and Emotions Predict each other throughout the Semester? (RQ1)
10.1.2 Do Achievement Motivation and Emotions Relate to Summative Achievement?
10.1.3 Do Feedback-Related Processes Vary according to Gender, Proficiency, and Course Design? (RQ2)
10.1.4 Do Expectancy-Value Appraisals Synergistically Predict Formative Achievement and Achievement Emotions?
10.1.5 Matthew Effects and Decreasing Salience of Feedback Effects
10.2 Practical Implications and Future Directions
10.2.1 The Necessity for Scaling up Formative Feedback in Higher Education
10.2.2 Methodological Considerations and Limitations of the Present Study
Bibliography
Recommend Papers

Electronic Feedback in Large University Statistics Courses: The Longitudinal Effects of Quizzes on Motivation, Emotion, and Cognition
 365841619X, 9783658416195

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Andreas Maur

Electronic Feedback in Large University Statistics Courses The Longitudinal Effects of Quizzes on Motivation, Emotion, and Cognition

Electronic Feedback in Large University Statistics Courses

Andreas Maur

Electronic Feedback in Large University Statistics Courses The Longitudinal Effects of Quizzes on Motivation, Emotion, and Cognition

Andreas Maur Johannes Gutenberg University Mainz Mainz, Germany Dissertation presented to the Faculty of Law, Management and Economics at the Johannes Gutenberg University of Mainz (D77) by Andreas Maur in Mainz in 2022 in fulfillment of the requirements for the degree of Doctor of Economic Political Sciences (Dr. rer. pol.). First assessor: Prof. Dr. Manuel Förster Second assessor: Prof. Dr. Olga Zlatkin-Troitschanskaia Day of the oral examination: November 3, 2022

ISBN 978-3-658-41619-5 ISBN 978-3-658-41620-1 (eBook) https://doi.org/10.1007/978-3-658-41620-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer VS imprint is published by the registered company Springer Fachmedien Wiesbaden GmbH, part of Springer Nature. The registered company address is: Abraham-Lincoln-Str. 46, 65189 Wiesbaden, Germany

Dedicated to my beloved mother.

Foreword

In 2015, Mr Maur had approached me after my seminar and expressed his interest in pursuing an academic career at the Chair of Business and Economics Education at the Johannes Gutenberg University Mainz. At the time, I was preparing an application for an independent third-party funding as part of my junior professorship. Well aware of the uncertainties involved in the arduous process of applying for such funding, Mr Maur agreed to commit himself to this endeavor. Our efforts eventually led to the approval of the FLIPPS project, funded by the Federal Ministry of Education and Research (2017–2020). A few years later, Mr Maur’s excellent work within FLIPPS came to fruition with his dissertation, based on data collected within the project, which was assessed summa cum laude in autumn 2022. Mr Maur’s research addresses several research gaps at once. Firstly, student reasoning in methodological statistics courses has been shown to be highly deficient. This is compounded by the fact that students attend these courses with anxiety and low motivation, for which large lectures in European public universities offer only a very limited set of potentially effective countermeasures. Assuming that the integration of blended learning and quiz feedback provides a feasible basis for fostering individual learning processes under such conditions, there still is limited research on longitudinal, reciprocal feedback effects based on comprehensive, latent competency frameworks. In higher education, the shortcomings in longitudinal research are largely due to the methodological challenges of measurement, and in particular the constant trade-off between ecologically valid measurements in the respective teaching environments and natural panel mortality (i.e., due to dropouts). However, for the analysis of learning processes, more ambitious latent modelling efforts in repeated surveys of larger student populations are necessary to relate potential changes in cognitive and

vii

viii

Foreword

attitudinal trajectories to concrete instructional approaches, which will then allow for evaluations of their effectiveness. With his dissertation, Andreas Maur is exploring this under-researched and highly challenging terrain. Based on Weinert’s multidimensional conceptualization of competences (2001), it is investigated how students’ motivational, emotional, and cognitive learning processes can be fostered throughout a large university course. As a basis for the formulation of the research hypotheses, the author collates and scrutinizes the current state of international research, starting from rather behavioristic to increasingly more holistic learning theories, leading to the “control-value theory of achievement emotions” (Pekrun, 2006). Based on the desiderata of this synthesis, Mr Maur derives the longitudinal assessment framework for the empirical study. After a critical review of existing measurement methodological challenges, a well-founded structural model is specified and thoroughly optimized with regard to its suitability for testing the hypotheses. Subsequently, a sophisticated latent approach is used to investigate the reciprocal relationship between quiz feedback and motivational as well as emotional development over the course of a semester in generic structural models and in multiple groups of relevant conditioning factors (i.e., gender, prior statistics-related knowledge, and course design). Using a well-developed latent structural equation model and under consideration of the highly complex interplay of motivational, emotional, and cognitive factors, Mr Maur’s dissertation has significantly advanced the analysis of longitudinal data for the valid and reliable assessment of learning outcomes in response to electronic feedback in large lectures. The evidence-based recommendations for scaling up electronic feedback and optimizing existing practice are not only of high relevance within the higher education research community. They also provide greater certainty for stakeholders in higher education practice and education policy in the aftermath of a global pandemic that has fueled the blending of onand off-campus learning and the need to use digital teaching methods and tools. In difference to previous studies, which only compared course designs without consideration to specific instructional elements, Mr Maur’s research suggests that feedback effects due to quizzes can be effectively enhanced in the autonomysupportive environment of a flipped classroom. Hence, his efforts were urgently needed to find effective support for upcoming generations of students with an increased need for flexibility and a constructive feedback culture in the digital era of higher education. The conscientious theoretical grounding and model

Foreword

ix

development, with due regard to various reliability and validity criteria, together with the robust SEM approach, enable the transfer of the analytical approach and the empirical findings to other subject areas and analytical complexes in higher education research as well. Munich, Germany

Manuel Förster

Acknowledgements

Writing a dissertation is like a very rewarding rollercoaster ride – with just as many ups and downs, but over a much longer period of time. Highs may be when the data reflect parts of your theory, when your data provide new insights, or when Mplus does not generate an error message in the output file. Lows can be periods of tedious data preparation, desperately searching internet forums for clues to your very specific problems, or simply feeling drained and guilty. Such a glorious journey could never be completed without the unwavering support of family, friends, and colleagues. I would therefore like to express my heartfelt thanks to all those who have sat beside me on the rollercoaster of my PhD journey. First of all, I would like to express my gratitude to my first doctoral supervisor, Prof. Dr. Manuel Förster, who accompanied me through the entire doctoral process, from preparation to submission. I am very grateful for the trust he placed in me since I was a master’s student who had not yet had the opportunity to demonstrate any significant academic skills. My work was embedded in the FLIPPS project (“Fostering students’ statistical reasoning in a flipped classroom design”). By working closely with Prof. Dr. Manuel Förster on this project from scratch, I really learned a lot of important skills and techniques for my postdoctoral career. I also appreciated his valuable input at all stages of my dissertation. He always managed to find the right balance between providing motivating challenges and uncomplicated solutions when I got stuck in the data analysis. In the context of the FLIPPS project, I would also like to thank our project partners, Prof. Dr. Thorsten Schank, Dr. Constantin Weiser, as well as our former student assistants, Felix Lang and Steffen Ecker, for their indispensable support during all steps of data acquisition, collection and analysis.

xi

xii

Acknowledgements

I would also like to thank Prof Dr. Olga Zlatkin-Troitschanskaia, my second doctoral supervisor, for providing me with another safe haven before, during and after my dissertation. Despite her own commitments, she is always willing to listen to the concerns of her staff. In this regard, I would also like to thank all the members of her department, my colleagues, for welcoming me into the team and for keeping me motivated during the final stages of my dissertation. In addition to the professional support, I would like to thank my family and friends for always believing in me and being there when I needed them in stressful times. First and foremost, my deepest gratitude goes to my parents, Rita and Günter Maur, who gave me everything I needed to achieve my goals. They took care of me in good and bad times and always supported me without compromise. The same goes for my other family, Alexander and Anke Maur, who have always given me very helpful, motivating, and supportive advice on important decisions and life choices. I would also like to thank Nanda for her relentless trust in me, for motivating and inspiring me, as well as for taking my mind off things in the right situations. This level of sympathy and support from my family and friends is not to be taken for granted and I am very grateful to have such understanding and helpful companions by my side.

Contents

1

2

Potentials of and Barriers to Teaching Statistics in Large Higher Education Lectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 The Incompatibility between the Relevance of Statistics for Personality Formation and Traditional Approaches to Teaching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Feedback and Flipped Teaching as Competenceand Autonomy-oriented Countermeasures to Improve Teaching Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Prevalent Gaps in the Research on Improving the Teaching of Statistics by Means of Instructional Adaptations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Derived Research Objectives and Structure of the Thesis . . . . Knowledge Acquisition and Transmission in the Domain of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Particularization of Statistical Reasoning to the Study Context in Depth and Breadth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Hierarchical Models of Statistics Reasoning and its Transition from Computational to Conceptual Understanding . . . . . . . . . . . . . . . . . . . . 2.1.2 Content Models of Statistics Reasoning and their Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Comparison between a Preliminary Statistics Core Curriculum and the Content Covered in the Investigated Statistics Course . . . . . . . . . . . . . . .

1

1

4

5 8 13 13

13 21

24

xiii

xiv

Contents

2.1.4

2.2 3

Implications from the Hierarchical and Content Models for the Measurement of Statistics Reasoning in the Present Study . . . . . . . . . . . . . . . . . . Impediments to the Furtherance of Statistical Reasoning in the Context of Teaching and Potential Expedients . . . . . . . .

A Model for Reciprocal Interrelations between Feedback and Statistics Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Formative Feedback and Academic Achievement . . . . . . . . . . . 3.1.1 The Significance of Formative Feedback for Academic Achievement in Theoretical and Empirical Research . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Design Characteristics Moderating the Feedback-Achievement Relationship . . . . . . . . . . . 3.2 Formative Feedback and Achievement Motivation . . . . . . . . . . 3.2.1 Broadening the Conceptualization of Statistical Reasoning to Encompass Achievement Motivation in the Uptake of Feedback . . . . . . . . . . . . . 3.2.2 Feedback Models Incorporating Notions of Achievement Motivation . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 The Feedback-Related Achievement Motivation Processing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Approaching the Theoretical and Empirical Research on the Interrelations between Construct-Specific Expectancy-Value Appraisals and Feedback . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Reciprocal Relations between Statistics Self-Efficacy and Achievement . . . . . . . . . . . . . . . . . . . . 3.2.6 Reciprocal Relations between Statistics Difficulty and Achievement . . . . . . . . . . . . . . . . . . . . . . . 3.2.7 Reciprocal Relations between Statistics Interest/Utility Value and Achievement . . . . . . . . . . . . . 3.2.8 Reciprocal Relations between Statistics Affect and Achievement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.9 Reciprocal Relations between Statistics Effort and Achievement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Formative Feedback and Achievement Emotions . . . . . . . . . . . .

26 27 33 33

33 36 41

41 43 51

53 54 57 61 67 72 77

Contents

xv

3.3.1

3.4 3.5 3.6

4

Broadening the Conceptualization of Statistical Reasoning to Encompass Achievement Emotions in the Uptake of Feedback . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Motivational and Emotional Uptake of Feedback According to the Control-Value Theory of Achievement Emotions . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Reciprocal Relations between Enjoyment, Hopelessness, and Achievement . . . . . . . . . . . . . . . . . . . 3.3.4 Differential Perceptions of Achievement Emotions in In-class and Out-of-class Learning Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multiplicative Effects of Expectancy-Value Appraisals on Achievement Emotions and Performance . . . . . . . . . . . . . . . . Average Development of Achievement Motivation and Emotion throughout a Semester . . . . . . . . . . . . . . . . . . . . . . . Intertwining the Expectancy- and Control-Value Theory to an Integrative Model of Motivational and Emotional Feedback Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Further Contextualization of Motivational and Emotional Feedback Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Variables and Contexts Considered Relevant for Feedback Processing in Statistics Education . . . . . . . . . . . . 4.2 The Prevalent Gender Differential in Statistics-related Motivational and Emotional Appraisals . . . . . . . . . . . . . . . . . . . . 4.2.1 Gender-related Mean Differences in Motivational and Emotional Appraisals in Theoretical and Empirical Research . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Gender-related Differences in Feedback Processing in Theoretical and Empirical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 The Shaping Role of Statistics-related Prior Knowledge in Motivational and Emotional Appraisals . . . . . . . . . . . . . . . . . 4.3.1 Expertise-related Mean Differences in Motivational and Emotional Appraisals in Theoretical and Empirical Research . . . . . . . . . . . . . 4.3.2 Expertise-related Differences in Feedback Processing in Theoretical and Empirical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

78 83

87 89 91

93 97 97 98

98

102 104

104

106

xvi

Contents

4.4

4.5

5

6

The Flipped Classroom as a Potential Catalyst of the Motivational and Emotional uptake of Feedback . . . . . . 4.4.1 Defining Characteristics of a Flipped Classroom . . . . 4.4.2 Theoretical Evidence on Motivational and Emotional Feedback Processing in Flipped Classrooms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Empirical Evidence on Motivational and Emotional Feedback Processing in Flipped Classrooms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Broadening the Feedback Model to Account for Individual and Contextual Differences in the Uptake of Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Empirical Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Analytical Method of Autoregressive Structural Equation Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Underlying Circumstances of the Data Collection . . . . . . . . . . . 5.2.1 The Traditional and Flipped Course Frameworks . . . . 5.2.2 Measurement Instruments . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Longitudinal Assessment Framework and Assessment Methods . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Quality Criteria of the Study Design and Measurement Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Objectivity Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Reliability and Validity Evidence Based on the Internal Structure . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Validity Evidence Based on Test Content . . . . . . . . . . . 5.3.4 Validity Evidence Based on Relations with other Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.5 Further Relevant Validity Criteria . . . . . . . . . . . . . . . . . 5.3.6 Implications From the Reliability and Validity Evidence for the further Analyses . . . . . . . . . . . . . . . . . 5.4 Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Missing Values and Panel Mortality . . . . . . . . . . . . . . . 5.4.3 Distribution of the Data . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation of the Unmodified Measurement Models . . . . . . . . . . . . . 6.1 Choice of an Appropriate Estimator . . . . . . . . . . . . . . . . . . . . . . . 6.2 Specification of the Factor-indicator Effect Direction . . . . . . . .

110 110

111

118

123 127 127 128 128 131 138 140 140 142 145 151 154 156 157 157 161 168 171 171 174

Contents

6.3 6.4 6.5

6.6

7

8

xvii

Implications of the Construct Specification for the Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Global Goodness-of-fit of the Unmodified Measurement Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Item-specific Analyses for the Unmodified Measurement Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Indicator Reliabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Dimensionality of the Item Structure . . . . . . . . . . . . . . Construct-level Analyses for the Original Measurement Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Composite Reliabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Average Variance Extracted . . . . . . . . . . . . . . . . . . . . . . 6.6.3 Fornell-Larcker Criterion . . . . . . . . . . . . . . . . . . . . . . . . .

Optimization of the Measurement Models . . . . . . . . . . . . . . . . . . . . . . 7.1 Expectancy-value Indicators Considered for Removal or Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Expectancy Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Value Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Indicator-specific Expectancy-value Effects . . . . . . . . . 7.2 Evaluation of the Modified Expectancy-value Constructs . . . . . 7.2.1 AVE, Composite Reliability, Fornell-Larcker, and Factorial Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Global Goodness-of-fit for the Expectancy-value Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Examination of the Assumed 6-Factorial Structure of the Expectancy-value Constructs . . . . . . . 7.3 Evaluation of the Modified Achievement Emotion Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Achievement Emotion Indicators Considered for Removal or Optimization . . . . . . . . . . . . . . . . . . . . . 7.3.2 AVE, Composite Reliability, Fornell-Larcker, and Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Residual Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Final Evaluation and Reconceptualization of the Modified Measurement Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of the Longitudinal Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Testing Measurement Invariance Across Time . . . . . . . . . . . . . .

176 178 182 182 185 195 195 197 198 203 203 204 207 209 214 214 218 220 222 222 224 226 228 235 235

xviii

Contents

8.2 8.3

8.4

9

Average Motivational and Emotional Trajectories throughout the Semester . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Separate Reciprocal Causation Models . . . . . . . . . . . . . . . . . . . . 8.3.1 Modeling Approach and Goodness-of-Fits . . . . . . . . . . 8.3.2 Expectancy-Feedback Model . . . . . . . . . . . . . . . . . . . . . 8.3.3 Value-Feedback Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Emotion-Feedback Models . . . . . . . . . . . . . . . . . . . . . . . Empirical Modeling of the Integrative Model of Achievement Motivation and Emotion . . . . . . . . . . . . . . . . . . 8.4.1 The Need for Downsizing the Expectancy-Value Causation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Integrating Achievement Motivation and Emotion into Additive Control-Value Causation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Squared Multiple Correlation of the Integrative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 The Examination of Multiplicative Effects within the Integrative Control-Value Causation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.5 Visualization of Conditional Expectancy-Value Effects by Means of Simple Slopes . . . . . . . . . . . . . . . .

Multiple Group Causal Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Operationalization of the Grouping Variables . . . . . . . . . . . . . . . 9.2 Testing Measurement Invariance Across Groups . . . . . . . . . . . . 9.3 Group-specific Average Development Throughout the Semester . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Group-specific Reciprocal Causation Models . . . . . . . . . . . . . . . 9.4.1 Goodness-of-fit of Each Group-Specific Model . . . . . 9.4.2 Design-specific Reciprocal Causative Models . . . . . . . 9.4.3 Gender-specific Reciprocal Causative Models . . . . . . . 9.4.4 Expertise-specific Reciprocal Causative Models . . . . . 9.4.5 Implications and Follow-up Questions from the Multiple Group Analyses . . . . . . . . . . . . . . . . 9.5 Secondary Findings on Group-specific Moderation Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Group-specific Multiplicative Expectancy-value Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

242 244 244 248 251 254 259 259

264 273

274 280 289 289 290 293 298 298 300 308 315 320 323 323

Contents

xix

9.5.2 9.5.3

Design-specific Quiz Effect Depending on Gender and Prior Knowledge . . . . . . . . . . . . . . . . . . The Moderating Effect of Achievement Emotions on the Expectancy-value Feedback Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Synoptic Evaluation of the Hypotheses . . . . . . . . . . . . . . . . . . . . 10.1.1 Do Formative Achievement as well as Achievement Motivation and Emotions Predict each other throughout the Semester? (RQ1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2 Do Achievement Motivation and Emotions Relate to Summative Achievement? . . . . . . . . . . . . . . . 10.1.3 Do Feedback-Related Processes Vary according to Gender, Proficiency, and Course Design? (RQ2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.4 Do Expectancy-Value Appraisals Synergistically Predict Formative Achievement and Achievement Emotions? . . . . . . . . . . . . . . . . . . . . . . 10.1.5 Matthew Effects and Decreasing Salience of Feedback Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Practical Implications and Future Directions . . . . . . . . . . . . . . . 10.2.1 The Necessity for Scaling up Formative Feedback in Higher Education . . . . . . . . . . . . . . . . . . . . 10.2.2 Methodological Considerations and Limitations of the Present Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

327

333 341 341

341 342

344

347 348 350 350 353 357

Abbreviations

2f2 AEQ AME appraisals C.I. CV e.g., EV FIML H i.e., M.I. SATS SATS-M SDT SEM TC/FC

Face-to-face Achievement Emotion Questionnaire Achievement motivation and emotion appraisals Confidence interval Control-value Exempli gratia/for example Expectancy-value Full information maximum likelihood Hypothesis Id est/that is Modification index Survey of Attitudes Towards Statistics Survey of Attitudes Towards Statistics Model Self-determination theory Structural equation modeling Traditional classroom/flipped classroom

xxi

List of Figures

Figure 3.1 Figure 3.2 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 Figure 3.7 Figure 3.8 Figure 4.1 Figure 5.1 Figure Figure Figure Figure Figure Figure Figure Figure

5.2 5.3 5.4 7.1 7.2 7.3 7.4 8.1

Carver and Scheier’s Feedback Model (2000) . . . . . . . . . . . Zimmerman’s Model of Self-Regulated Learning (2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theoretical Expectancy-Value Model . . . . . . . . . . . . . . . . . . Feedback-Related Achievement Motivation (FRAM) Processing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structure of the Theoretical Analysis for Sections 3.2.5–3.2.9 and 3.3.3 . . . . . . . . . . . . . . . . . . . . . Control-Value Model of Achievement Emotions . . . . . . . . . Feedback-Related Achievement Motivation and Emotion (FRAME) Processing Model . . . . . . . . . . . . . . Analytical Adaptation of the FRAME model with Corresponding Hypotheses . . . . . . . . . . . . . . . . . . . . . . Contextualized FRAME Model according to Supply and Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Learning Opportunities and their Structure within the Flipped Classroom . . . . . . . . . . . . . . . . . . . . . . . . Exemplary Questions from an Electronic Quiz . . . . . . . . . . Longitudinal Assessment Framework . . . . . . . . . . . . . . . . . . Distribution of Skewness and Kurtosis in the Dataset . . . . . Modified Expectancy Measurement Models . . . . . . . . . . . . . Modified Value Measurement Models . . . . . . . . . . . . . . . . . . Modified Course Emotion Measurement Models . . . . . . . . . Modified Learning Emotion Measurement Models . . . . . . . Modelling of Autoregressive and Cross-Lagged Paths Using the Example of the Expectancy Model . . . . . . . . . . .

34 45 48 52 53 80 83 94 124 130 133 138 169 229 229 230 231 245

xxiii

xxiv

Figure 8.2 Figure 8.3 Figure 8.4 Figure 8.5 Figure 8.6 Figure 8.7 Figure 8.8 Figure 8.9 Figure 8.10 Figure 8.11 Figure 8.12 Figure 8.13 Figure 8.14 Figure 8.15 Figure 9.1

List of Figures

Path Diagram for the Relationships between Quiz and Expectancy Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Path Diagram for the Relationships between Quiz and Value Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Path Diagram for the Relationships between Quiz and Course Emotion Factors . . . . . . . . . . . . . . . . . . . . . . . . . Path Diagram for the Relationships between Quiz and Learning Emotion Factors . . . . . . . . . . . . . . . . . . . . . . . . Path Diagram for the Relationships between Quiz and Expectancy-Value Constructs . . . . . . . . . . . . . . . . . . . . . Path Diagram for the Relationships between Quiz, Expectancy-Interest, and Emotion Factors . . . . . . . . . . . . . . Path Diagram for the Relationships between Quiz, Expectancy-Value, and Emotion Factors . . . . . . . . . . . . . . . . Path Diagram for the Interactive Relations between Expectancy-Interest, Quiz, and Emotions . . . . . . . Path Diagram for the Interactive Relations between Expectancy-Value, Quiz, and Emotions . . . . . . . . . Johnson-Neyman Diagram for the Effect of Value on Quiz 1 with Increasing Self-Efficacy . . . . . . . . . . . . . . . . Diagram on the Effect of Value at t1 on Quiz 1 at Varying Self-Efficacy Levels . . . . . . . . . . . . . . . . . . . . . . . Johnson-Neyman Diagram for the Effect of Interest on Quiz with Increasing Self-Efficacy . . . . . . . . . . . . . . . . . . Johnson-Neyman Diagram for the Effect of Value on Learning Enjoyment with Increasing Self-Efficacy . . . . Diagram of the Effect of Value at t1 on Learning Enjoyment at t3 at Varying Levels of Self-Efficacy . . . . . . . Path Diagram for the Multiplicative Relations of Quiz and Learning Emotions on Expectancy-Value Appraisals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

250 253 256 259 261 267 271 278 279 281 283 284 285 286

334

List of Tables

Table 2.1 Table 2.2 Table 3.1 Table 3.2 Table 5.1 Table 5.2 Table 5.3 Table 5.4 Table 5.5

Table 5.6 Table 5.7 Table 5.8 Table 5.9

Comparative Overview of Statistical Reasoning Models according to Aspiration Level . . . . . . . . . . . . . . . . . . . . . . . . . Statistical Core Curriculum Based on Ten Textbooks in Comparison to the Investigated Statistics Course . . . . . . . Properties of the Formative and Summative Assessments of the Investigated Statistics Course . . . . . . . . . Exemplary Overview of a Variety of Achievement Emotions according to the 3 × 2 taxonomy . . . . . . . . . . . . . Comparison of the Characteristics of the Traditional and Flipped Course in the Context of this Study . . . . . . . . . . Variable Definitions, Scaling, and Cronbach’s α Reliabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Construct Definitions, Scaling, and Cronbach’s α Reliabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abbreviations Used to Refer the Constructs throughout their Longitudinal Assessment . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of the Content Assessed in the Quizzes and Exam Compared to its Coverage in the Course and an Exemplary Core Curriculum . . . . . . . . . . . . . . . . . . . . Mean Comparison and their Effect Sizes for Both Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample Size and Share for Both Cohorts throughout the Semester . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Share of Matched Participants in Both Cohorts . . . . . . . . . . . Correlations Between Selected Manifest Items and Missingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18 25 36 79 130 135 136 140

147 158 159 161 167

xxv

xxvi

List of Tables

Table 6.1 Table 6.2 Table Table Table Table

6.3 6.4 6.5 6.6

Table 6.7 Table 6.8 Table 6.9 Table 6.10 Table 6.11 Table 6.12 Table 6.13 Table 6.14 Table 6.15 Table 6.16 Table 6.17 Table 7.1 Table 7.2 Table 7.3 Table 7.4 Table 7.5 Table 7.6

Decision Rules for Factor-Indicator Effect Directions . . . . . Overview of Quality Criteria and Cut-Offs Underlying the Present Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fit Indices of the Unmodified Measurement Models . . . . . . Indicator Reliabilities of All Original Items . . . . . . . . . . . . . . Model Fit of the Original Expectancy Constructs . . . . . . . . . GEOMIN Rotated Factor Loadings for the Original Expectancy Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model Fit of the Original Value Constructs . . . . . . . . . . . . . . GEOMIN Rotated Factor Loadings for the Original Value Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model Fit of the Original Learning-Related Emotion Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rotated factor loadings for the original learning-related emotion constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model Fit of the Original Class-Related Emotion Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rotated Factor Loadings for the Original Class-Related Emotion Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Composite Reliabilities and Factor Determinacies of the Original Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . AVE of the Original Factors . . . . . . . . . . . . . . . . . . . . . . . . . . Root AVE and Factor Correlations of the Original Expectancy Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Root AVE and Factor Correlations of the Original Value Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Root AVE and Factor Correlations of the Course and Learning Emotion Constructs . . . . . . . . . . . . . . . . . . . . . . Identifiers for the Measurement Model Evaluation . . . . . . . . Problematic Indicator Reliabilities and Cross-Loadings of the Expectancy Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problematic Indicator Reliabilities and Cross-Loadings of the Value Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modification Indices Greater than 10 in Ascending Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of the Original and Modified Expectancy Models regarding Composite Reliability and AVE . . . . . . . . Root AVE and Correlations of the Modified Expectancy Constructs with Method Factors . . . . . . . . . . . . .

175 179 181 183 186 187 189 189 191 192 193 194 196 198 199 200 201 203 204 207 212 214 215

List of Tables

Table 7.7 Table 7.8 Table 7.9 Table 7.10 Table 7.11 Table 7.12 Table 7.13 Table 7.14 Table 7.15 Table 7.16 Table 7.17 Table 7.18 Table 8.1 Table 8.2 Table 8.3 Table 8.4 Table 8.5 Table 8.6 Table 8.7 Table 8.8

xxvii

GEOMIN Rotated Loadings for the Modified Expectancy Constructs with Method Factors . . . . . . . . . . . . . Comparison of the Original and Modified Value Models regarding Composite Reliability and AVE . . . . . . . . Root AVE and Correlations of the Modified Value Constructs with Method Factors . . . . . . . . . . . . . . . . . . . . . . . GEOMIN Rotated Loadings for the Modified Value Constructs with Method Factors . . . . . . . . . . . . . . . . . . . . . . . Model Parameters for the Original and Modified EV Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GEOMIN Rotated Loadings for the Modified EV Constructs with Method Factors . . . . . . . . . . . . . . . . . . . . . . . Problematic Indicator Reliabilities and Cross-Loadings of the Emotion Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of the Original and Modified Emotion Models regarding Composite Reliability and AVE . . . . . . . . Model Parameters for the Original and Modified Emotion Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modification Indices for Within-Construct Residual Correlations > 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indicator Reliabilities of the Modified Measurement Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of the Original versus Revised Construct Meanings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Magnitude of Model Fit Change Indicating Measurement Non-Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . Sensitivity Check of the Item-Level Invariance of the Selected Reference Indicators . . . . . . . . . . . . . . . . . . . . Goodness-of-Fit across Invariance Levels assuming the Unconstrained (Baseline) Model to be Correct . . . . . . . . Latent Average Development throughout the Semester . . . . . Identifiers for the Separate Structural Models . . . . . . . . . . . . Structural Model Fit under Configural and Weak Invariance Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structural Relationships between Quiz and Expectancy Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structural Relationships between Quiz and Value Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

216 216 217 218 219 221 223 225 225 226 228 233 236 237 239 242 246 247 249 251

xxviii

Table 8.9 Table 8.10 Table 8.11 Table 8.12 Table 8.13 Table 8.14 Table 8.15

Table 8.16 Table 8.17 Table 8.18 Table 8.19 Table 8.20 Table 8.21 Table 8.22 Table 8.23 Table 8.24 Table 9.1 Table 9.2 Table 9.3 Table 9.4

List of Tables

Structural Relationships between Quiz and Course Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structural Relationships between Quiz and Learning Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Identifiers for the Separate Structural Expectancy-Value Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Identifiers for the Structural Control-Value Models . . . . . . . Comparative Model fit of three different control-value models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reciprocal Relationships between Quiz, Expectancy-Interest, and Emotion Factors . . . . . . . . . . . . . . . Cross-lagged Relationships between Expectancy-Interest, and Emotion Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reciprocal Relationships between Quiz, Expectancy-Value, and Emotion Factors . . . . . . . . . . . . . . . . . Cross-lagged Relationships between Expectancy-Interest, and Emotion Factors . . . . . . . . . . . . . . . Squared Multiple Correlation of the Endogenous Constructs from the Integrative Control-Value Models . . . . . Identifiers for the Multiplicative Control-Value Models . . . . Comparative Fit of Additive versus Multiplicative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Effect of Value at t1 on Quiz 1 at Varying Self-Efficacy Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Effect of Interest at t1 on quiz 2 and t5 on quiz 4 for Varying Self-Efficacy Levels . . . . . . . . . . . . . . . . . . . . . . . Effect of Value on Learning Enjoyment at Varying Self-Efficacy Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Effect of Interest on Learning Enjoyment at Varying Self-Efficacy Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparative Model Fit of the Baseline Model versus Group-Specific Strong Invariance . . . . . . . . . . . . . . . . Average Development for Course Design, Gender, and Proficiency across the Semester . . . . . . . . . . . . . . . . . . . . Model Fit of the Group-Specific Models under Weak Factorial Invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Design-Specific Structural Relations between Quiz and Expectancy Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

255 258 262 264 265 266

268 270 272 273 276 277 282 284 285 287 291 294 299 301

List of Tables

Table 9.5 Table 9.6 Table 9.7 Table 9.8 Table 9.9 Table 9.10 Table 9.11 Table 9.12 Table 9.13 Table 9.14 Table 9.15 Table 9.16 Table 9.17 Table 9.18 Table 9.19 Table 9.20

xxix

Design-Specific Structural Relations between Quiz and Value Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Design-Specific Structural Relations between Quiz and Course Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Design-Specific Structural Relations between Quiz and Learning Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gender-Specific Structural Relations between Quiz and Expectancy Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gender-Specific Structural Relations between Quiz and Value Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gender-Specific Structural Relations between Quiz and Course Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gender-Specific Structural Relations between Quiz and Learning Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Expertise-Specific Structural Relations between Quiz and Expectancy Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Expertise-Specific Structural Relations between Quiz and Value Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Expertise-Specific Structural Relations between Quiz and Course Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Expertise-Specific Structural Relations between Quiz and Learning Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Group-Specific Multiplicative Expectancy-Value Effect on Learning Enjoymen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Design-Specific Quiz Effects according to Gender and Proficiency Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quiz Effect on Expectancy-Value Appraisals at Varying Levels of Course Hopelessness . . . . . . . . . . . . . . . . . . . . . . . . Quiz Effect on Expectancy-Value Appraisals at Varying Levels of Learning Hopelessness . . . . . . . . . . . . . . . . . . . . . . . Quiz Effect on Expectancy-Value Appraisals at Varying Levels of Course Enjoyment . . . . . . . . . . . . . . . . . . . . . . . . . .

304 306 307 309 311 313 314 315 317 319 320 324 328 335 336 338

1

Potentials of and Barriers to Teaching Statistics in Large Higher Education Lectures

1.1

The Incompatibility between the Relevance of Statistics for Personality Formation and Traditional Approaches to Teaching

Ever increasing amounts of quantitative information in all spheres of life rendered data literacy an indispensable 21st century skill to understand, interpret, and communicate data (Martin et al., 2017, p. 455; Showalter, 2021). This development is well-reflected in the presence of statistical modules in a large variety of degree courses, such as social sciences, economics1 , politics, medicine, sport science, geology, and psychology (González et al., 2016, p. 214; Martin et al., 2017, p. 455; Waples, 2016). Statistics often functions as gatekeeper which may discourage students from persisting with the respective degree course (Ross et al., 2018). Failure in a statistics course is thus likely a common reason for university dropout particularly between the first and second academic year (Nichols & Dawson, 2012, p. 469; Ross et al., 2018). Apart from college statistics modules, students are required to explore are also required to explore theories and to scrutinize data-driven propositions in the course of their studies and particularly in their final theses. In professional life, statistical competencies are highly in-demand skills and an integral part of many social science- and economics-related job advertisements (von der Lippe & Kladroba, 2008; Showalter, 2021) as requirement to make evidence-based managerial decisions. The growing importance and omnipresence of statistical concepts requires beneficial learning environments to

1

Zlatkin-Troitschanskaia et al. (2013) for instance, have shown that statistical and mathematical modules account for up to 20 % of the credit points of economics curricula at German universities and universities of applied sciences.

© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 A. Maur, Electronic Feedback in Large University Statistics Courses, https://doi.org/10.1007/978-3-658-41620-1_1

1

2

1

Potentials of and Barriers to Teaching …

foster students’ statistical reasoning skills earliest possible to a to prevent students from turning away from statistics for good (Schau & Emmioglu, 2012, p. 86; Showalter, 2021). The relevance of statistical education is however opposed to often reported unfavorable cognitive, motivational, and emotional starting conditions in higher education statistics courses2 . From the cognitive perspective, beginning students were reported to have a variety of misconceptions about statistics (Garfield & Ben-Zvi, 2007). Empirical studies found them only formally and rigidly applying statistical algorithms without being able to scrutinize, intertwine or transfer these concepts in unknown contexts (Garfield & Ben-Zvi, 2007, p. 386). The investigated misconceptions could be barely remedied even by means of formal instruction throughout a semester (Garfield & Ben-Zvi, 2007; Hirsch & O’Donnell, 2001; Martin et al., 2017, p. 456; Zieffler et al., 2008). Apart from statistics competencies, non-cognitive factors have emerged as primary factors in understanding students’ statistics achievement and are generally related to the expectancies for success, the task value, and feelings towards the subject (Shahirah & Moi, 2019, p. 651). However, empirical findings point to the fact that students enter statistics courses with rather negative expectations (Förster & Maur 2015; Garfield et al., 2014; Khavenson et al., 2012, p. 2126; Perepiczka et al., 2011; Niculescu et al., 2016, p. 290). In particular, research has documented that statistics-related attitudes are among the most salient predictors of learning outcomes in statistics courses (Macher et al., 2012; Finney & Schraw, 2003; Hanna et al., 2008; Emmioglu et al., 2018, p. 121). Students frequently choose their preferred field of study, such as social science degree courses, without being aware of its high amounts of statistical content in the first semester (Garfield et al., 2014; Perepiczka, et al., 2011; Waples, 2016). Low expectations of success, low subjective task value, and statistics anxiety were shown to prevail in methodological courses and affect achievement motivation and performance, eventually (Erzen, 2017, p. 76; Förster & Maur, 2015; Garfield et al., 2014; Macher et al., 2013; Waples, 2016). These negative manifestations tempt students to postpone enrolling in statistics courses and thus jeopardize their degree attainment (Onwuegbuzie, 2004). Since most students only take one basic statistics courses in their undergraduate course of study, the chances to get students more enthusiastic about statistics are limited, so that there is an acute need for

2

In chapter 1, motivation and emotion will be used as general terms to refer to the perceptions to be investigated in the present study. These perceptions will be further differentiated in section 3.2 and 3.3.

1.1 The Incompatibility between the Relevance …

3

action to rework statistics courses instructionally (Harackiewicz, Smith, et al., 2016, p. 221; Kerby & Wroughton, 2017, p. 476; Ramirez et al., 2012). Statistics-related attitudes have been documented to be recalcitrant to change by means of interventions (Schau & Emmioglu, 2012; Xu & Schau, 2021, p. 316). The implementation of instructional measures to attenuate such unfavorable cognitive, motivational, and emotional manifestations is aggravated by the heterogeneous preconditions with which beginning students enter introductory statistics courses. Student cohorts themselves become increasingly heterogeneous due to enrolments from diverse backgrounds (González et al., 2016, p. 214; Ross et al., 2018). Thus, students begin their first statistics course with mixed mathematical and statistical prior knowledge, attained from different educational paths, which has been shown to affect their overall motivation to attend the course and, eventually, with the final exam performance (Ramirez et al., 2012). Particularly in the domain of statistics, prior knowledge and gender were shown to be highly relevant determinants of students’ motivational and emotional states of mind. Generally, female and lower proficient students were shown to have lower statistics-related expectations for success and lower interest (Ramirez et al., 2012). PISA studies have shown that these preconceptions are rooted in secondary school mathematics (Goldman & Penner, 2016; OECD, 2015). Hence, heterogeneity in student characteristics rubs off on their motivational, emotional, and cognitive learning outcomes as well. Gender and prior knowledge being immutable correlates of unfavorable predispositions when entering a statistics course, instructors are even more in need to find adequate instructional measures to counteract disadvantages of specific student groups. Addressing this heterogeneity is even more difficult in the teaching constellations of European public universities, in which introductory statistics courses of various degree courses are usually organized as large lectures in lecture halls with up to 700 students with tiered seating (Gannaway et al., 2017; Hood et al., 2021; Huxley et al., 2018). Such teaching arrangements have very limited possibilities to adequately address students’ individual cognitive, motivational, and emotional needs (Hadie et al., 2019; McKenzie, et al., 2013). In large lectures, the instructor delivers large amounts of prescribed curricular content while students, often unprepared, take notes without being expected to scrutinize whether they actually understand what had been said (Banfield & Wilkerson, 2014, p. 291). Such a passive way of knowledge transmission may further demotivate students (Haraldseid et al., 2015; Jacob et al., 2019, p. 1769; Vogel & Schwabe, 2016; Zhu et al., 2018) and at least fails to address, or even aggravates, the abovementioned unfavorable learning predispositions towards statistics. More concretely, given the role of passive recipients without training and self-regulatory

4

1

Potentials of and Barriers to Teaching …

opportunities during the semester, students are unprepared during the semester and lapse into “binge learning” or “cramming” until shortly before the final exam and recognize knowledge gaps when it is already too late to correct them (Bälter et al., 2013, p. 234; Cook & Babon, 2017, p. 12; Loeffler et al., 2019; Takase et al., 2019). These short-dated and discontinuous learning patterns hinder the activation of higher-order thinking processes, and leave only insufficient time to recognize, reflect on, and remediate existing misconceptions (Bälter et al., 2013; Gilboy et al., 2015; Schmidt et al. 2015; Walker et al., 2008). The lack of regular training and monitoring thus further affects knowledge acquisition in the long run because it favors inert and transient knowledge structures, which will be forgotten easily (Hadie et al., 2019; Macher et al., 2012; Ramirez et al., 2012; Schmidt et al., 2015; Zimmerman & Moylan, 2009, p. 299). Under these conditions, gender- and expertise-related differences in large lectures have already been found to become larger over the course of a semester (Förster & Maur, 2015). In light of the depicted challenges and the presence of statistics in many different degree courses, due regard has to be given to effective and beneficial teaching. Hence, concrete remedial teaching methods applicable in large lectures are strongly required to thwart such negative learning trajectories. A number of studies have investigated in how far motivational and emotional perceptions in statistics change over the course of a semester. However, little research has been conducted on whether and in how far concrete instructional approaches could be related to such an attitudinal change, so that not much is known about how instructors can optimally transmit statistical knowledge in large lectures (Garfield & Ben-Zvi, 2007, p. 378).

1.2

Feedback and Flipped Teaching as Competenceand Autonomy-oriented Countermeasures to Improve Teaching Statistics

A possible approach to provide a large number of students with frequent selfregulatory information about their misconceptions and their current level of knowledge could lie in the implementation of formative, standardized feedback by means of electronical quizzes. Such electronical tools recently became more firmly established in higher education as they factor in the growing demand for flexibility of the more technophile student generation (Day et al., 2018; McKenzie et al., 2013; Murphy & Stewart, 2015). This likely stems from the fact that electronical quizzes can be used for time- and cost-efficient, simultaneous large group assessments and as a way to keep students on track throughout the semester

1.3 Prevalent Gaps in the Research …

5

(Bälter et al., 2013, p. 23; Day et al., 2018). Another key advantage of such components lies in the capability to disseminate content to large groups of students. Moreover, several meta-analyses and studies have pointed to the positive effects of electronic feedback on learning outcomes (Azzi et al., 2014; Marden et al., 2013; McNulty et al., 2014; Salzmann, 2015). Instructor-centered large lectures in particular undermine millennial students’ desire for autonomy, choices, and ownership of competence because of their controlling style and the prevalent student-teacher power differential (Baker & Goodboy, 2019, p. 80; Harpe et al., 2012, p. 247). The flipped classroom is an innovative teaching approach which is assumed to address these particular needs by relocating knowledge transmission outside the classroom by means of pre-class activities (i.e., watching educational videos or reading assignments). Similar to electronic quizzes, digital learning content, such as videos and online simulations, can be made accessible to large groups of students (Chans & Castro, 2021). They can receive the content more flexibly as often as they need and at their preferred pace, time, and place—while the time in class is devoted to in-depth problem solving in groups mentored by the instructor. Following the self-determination theory (SDT), more autonomy in flipped classrooms positively influence students’ motivation, emotions, engagement, and academic achievement (Combs & Onwuegbuzie, 2012, p. Ramirez et al., 2012, p. 63). Despite its potential, the teaching format is not sufficiently considered in university teaching practice due to its higher implementation efforts and concerns on negative reactions from students who are accustomed to traditional teaching methods (Fidalgo-Blanco et al., 2017, p. 720; van Alten et al., 2019).

1.3

Prevalent Gaps in the Research on Improving the Teaching of Statistics by Means of Instructional Adaptations

In the context of statistics education, the impact of formative feedback and the flipped classroom on students’ motivational and emotional peprceptions is largely under-researched. Regarding formative feedback, only two studies were found, which however focused exclusively on cognitive outcomes (Lovett, 2001; Mevarech, 1983). The flipped classroom has been recently investigated in the context of statistics education (e.g., Gundlach et al., 2015; Nielsen et al., 2018; Pablo & Chance, 2018; Showalter, 2021). These studies however have different methodological shortcomings; students could self-select whether to participate in the flipped or traditional course, leading to unevenly distributed samples

6

1

Potentials of and Barriers to Teaching …

and obscuring students’ motivational and emotional inclinations; the course characteristics of traditional versus flipped design were not sufficiently distinct (e.g., traditional classroom already featured intense group work; lack of educational videos in flipped classroom). Broadening the scope of research to other domains, both the efficacy of feedback and flipped classroom formats (and other student-centered instructional approaches) is mostly investigated using mean comparisons3 (for feedback: Gundlach et al., 2015; Pablo & Chance, 2018; for flipped classrooms: Day et al., 2018, p. 918; Lo & Hew, 2021; O’Flaherty & Phillips, 2015). The majority of these studies focused on comparing academic achievement in different groups (e.g., feedback/no feedback or flipped/traditional) and only occasionally included motivational and emotional variables as covariates. Accordingly, most of these studies analyzed data at the aggregate level only (Kher et al., 2013, p. 1816). As regards feedback, the bulk of comparative studies limits the available information on effect mechanisms between feedback and student perceptions (Day et al., 2018, p. 918; Farmus et al., 2020). As regards the course design, mean comparisons of different teaching contexts with various design features provide only limited insight into design-specific effect mechanisms between perceptions and achievement, for instance (Lo & Hew, 2021, p. 13). This is aggravated by the fact that the flipped classroom lacks a common conceptual basis, so that each study uses different implementation strategies (Bishop & Verleger, 2012; O’Flaherty & Phillips, 2015). Moreover, the majority of studies investigating formative feedback or intermediate assessments have sample sizes smaller than or around n = 100 (Kleij et al., 2015), which also applies to studies investigating the flipped classroom (Farmus et al., 2020; Lo & Hew, 2021). Apart from methodological desiderates, another limitation lies in the selection of variables and constructs in most studies on feedback and flipped classroom. More concretely, motivational, and emotional perceptions were incorporated rather fragmentarily (e.g., only a small selection of constructs) and unsystematically from insufficiently validated surveys (e.g., self-constructed items, faculty surveys or other not well-established scales; Lo & Hew, 2021, p. 13). This also entails the lack connections with comprehensive and influential educational frameworks that point to decisive perceptional variables and how they should interrelate with each other, the instructional conditions, and achievement (Combs & Onwuegbuzie, 2012, p. 351; González, 2016, p. 214; Morris et al.,

3

As regards feedback, a smaller number of studies investigated the relationship of feedback with final assessment grades and other affective outcomes.

1.3 Prevalent Gaps in the Research …

7

2021; Yang et al., 2021, p. 422). Pekrun’s control-value (CV) theory of achievement emotions (2006), for instance, provides a framework which postulates that instructional features impact expectancies for success and subjective value, which thereafter influence the emotional reactions, and achievement, eventually. The CV theory also assumes that achievement is reciprocally related to subsequent expectancies and values over time and is thus reconcilable with the process of feedback reception and calls for longitudinal research designs. Given this context, a longitudinal cohort study would be suited for sequencing and alternating causeand-effect relationships between motivation, emotion, feedback, and achievement in a clear chronology without recall bias. Despite the relevance of a longitudinal consideration of the reciprocal learning process, most studies on the efficacy of feedback or the flipped classroom were cross-sectional and only assessed student appraisals at a single point of time in relation to an assessment event (Bishop & Verleger, 2012; Hew & Lo, 2018, p. 11; O’Flaherty & Phillips, 2015; Peterson et al., 2015, p. 82). Most empirical studies on intermediate assessments use unidirectional predictive designs and thus neglect the reciprocal nature of the linkages as postulated by Pekrun (2006). In the field of statistics education, some longitudinal studies exist that use latent structural equation modeling (SEM) with larger samples (e.g., Chiesi & Primi, 2010; Finney & Schraw, 2003; Macher et al., 2013; Schau & Emmigolu, 2012, p. 86; Tempelaar, Gijselaers, et al., 2007). However, such studies almost exclusively treated achievement-related variables as dependent of student perceptions. Hence, they focused on the direction of effect from perception to achievement while conversely ignoring potential feedback effects that (prior) success and failure have in shaping students motivational and emotional perceptions (Jarrell et al., 2017, p. 1264; Yang et al., 2021, p. 422). Studies in the context of CV theory also used longitudinal SEM to analyze reciprocal effects (e.g., Clem et al., 2021; Niculescu et al., 2016; Pekrun et al., 2017). These studies focused on crosslagged, inter-attitudinal effects between motivational and emotional perceptions. While these studies may have helped to understand the structure and etiology of these appraisals, the discussion parts resorts to anecdotal solutions to obstacles in teaching practice, such as presumptively appealing activity-oriented methods (e.g., inclusion of small experiments; problem-oriented teaching etc.), without empirical foundation (Ruggeri et al., 2008, p. 65). This is because most studies did not relate student perceptions and attitudes to instructional antecedents or externalities, and thus provided limited implications for designing motivationally and emotionally sound learning environments (Onwuegbuzie, 2004, p. 4; Pekrun & Linnenbrink-Garcia, 2012, p. 278; Tempelaar, van der Loeff, et al., 2007a, p. 81; Waples, 2016, p. 285). Hence, the research on statistics knowledge

8

1

Potentials of and Barriers to Teaching …

acquisition is in a dilemma of having recognized a firm link between motivation, emotion, and achievement—while not providing leverage points for concrete didactic measures for improving students’ motivational and emotional perceptions (Combs & Onwuegbuzie, 2012, p. 351; Garfield & Ben-Zvi, 2007, p. 389; Kher et al., 2013, p. 1816; Zieffler et al., 2008). The subsequent list summarizes the three main desiderates in the respective research fields as mentioned above, which serve as a basis to derive the research design and research questions in the following subchapter: 1. Construct selection and empirical testing unrelated to comprehensive educational frameworks. 2. Focus of research methods on mean comparisons, or, to a smaller degree, unidirectional designs. 3. Limited longitudinal findings on the reciprocal effect mechanisms between specific instructional features, motivational and emotional appraisals, and (formative/summative) achievement in different course formats with large samples. Hence, more evidence is necessary to disentangle the reciprocal interrelationships between feedback and students’ motivational and emotional appraisals.

1.4

Derived Research Objectives and Structure of the Thesis

To bridge the above-mentioned gaps in research, the main objective of this study was to investigate the reciprocal interrelations between expectancy-value (EV) and emotional appraisals, and standardized electronic feedback, in a longitudinal study based on the CV theory framework. The practical purpose of the study is to determine whether feedback fosters the motivational and emotional perceptions throughout a complete semester in a large lecture. In order to account for potential addressee differences, the feedback-appraisal relationships will also be tested for moderating effects of the heterogeneity-generating variables gender, prior knowledge, and the instructional medium itself, i.e., flipped classroom versus traditional classrooms4 . In all conscience, no study so far investigated the reciprocal linkages 4

The data for the mentioned analyses was gathered from samples in the context of the project FLIPPS (“Fostering statistics-related learning processes in large higher education flipped classrooms”), funded by the Federal Ministry of Education and Research (grant number DHL1035) from 2017–2020.

1.4 Derived Research Objectives and Structure of the Thesis

9

between feedback, achievement motivation and emotion in alternation throughout a semester using longitudinal SEM based on the CV theory framework with thoroughly validated measurement instruments (Pekrun, 2006). The study thus seeks to provide advice on into the integration of standardized formative assessments into university lectures and seminars to make students profit in the best possible way. Therefore, feedback will be explicitly factored in the model as manifest variable to help explain concrete and actionable impacts in different contexts, leading to the following research questions (RQ):

RQ1 :

RQ2 :

In how far does feedback from electronical quizzes foster expectancies for success, subjective task values, and achievement emotions throughout a complete semester? In how far does feedback-related motivational and emotional processing depend on gender, prior knowledge, and the instructional medium in which feedback is conveyed?

In a first step, the study investigates in how far the quiz feedback interrelates reciprocally with students’ achievement motivation and achievement emotions in repeated measurements over the course of a semester. In a second step, the reciprocal relationships between feedback and appraisals will be investigated with regard to gender, prior knowledge (Marchand & Gutierrez, 2017; OECD, 2015; Ramirez et al., 2012), and course design (traditional/flipped) to either generalize the results from step 1 or to account for differential functioning of some feedbackappraisal relationships. The thesis is structured as follows: Chapters 2 to 4 constitute the theoretical framework, in which the theoretical model is successively build up. Chapter 2 shines a light on the assessment of the central formative and summative achievement variables, relating to statistical reasoning, under consideration of renowned statistics competence models (section 2.1). The assessment of both formative and summative achievement variables will be delineated as regards their coverage of thinking skills (section 2.1.1), of statistics content areas from other measurement instruments (section 2.1.2), and core curricula from statistics textbooks (section 2.1.3). After a reconciliation of the prevalent assessment methods and contents with the specific conditions of the statistics course, implications for the selection of the measurement will be drawn (2.1.4). Section 2.2 factors common stumbling blocks in statistics knowledge acquisition into the decision making on the best possible assessment practice in large statistics lectures.

10

1

Potentials of and Barriers to Teaching …

Having delineated the conceptual basis of contents and skills to be assessed, section 3.1 explains in how far feedback is expected to contribute to statistical knowledge acquisition whilst considering feedback models with a behavioral focus. After having established the link between statistics knowledge acquisition and feedback, the assessment design and their potential moderating influence will be set in the context of the present study. Section 3.2 marks the shift from the behavioral to the motivational perspective of feedback reception. Based on the assumption that achievement motivation is an integral part of self-regulating feedback information, Zimmerman’s self-regulatory model and, most importantly, the EV model will be introduced as theoretical basis for hypothesis generation. After a consolidation of the EV model within the study context, each constituent EV appraisal (e.g., self-efficacy, interest value, utility value, effort) will be related to feedback reception theoretically and empirically in due consideration of their reciprocal linkages over time to generate hypotheses H1–H5. In section 3.3, the EV model will be expanded by the perspective of achievement emotions according to the CV model and their relationship and elaborated analogously to section 3.2, which leads to hypotheses H6–H7. Before achievement motivation and emotions are integrated into one hypothesis model (3.6), multiplicative effects (3.4) and longitudinal development (3.5) of the appraisals will be considered to receive a clear picture of the inner workings of the model. The final extension of the model is particularized in chapter 4 in such a way that the general relations between feedback and student appraisals will be made contingent on students’ gender, prior knowledge, and the course format in which feedback is delivered. Based on theoretical and empirical research on gender-, expertise-, and design-specific differences in statistics and related academic fields, exploratory assumptions will be adapted to the reciprocal feedback-appraisal mechanisms. Chapter 5 to 9 mark the empirical part of the thesis. As a starting point for the latent autoregressive SEM, chapter 5 describes the courses, samples, measures, and their longitudinal arrangement along with relevant quality criteria according to the standards of the American Educational Research Association. To assure a sufficient quality of the measurement models before structural modeling, chapter 6 analyzes the constructs with regard to their reliability, validity, and model fit—with necessary adaptations being made in chapter 7. After having established measurement invariance across time as a necessary requirement for the longitudinal analysis, the generic structural feedback models will be set up in separate bundles (expectancy, value, emotions) to investigate the interrelations with feedback more closely (chapter 8. After the separate consideration, an integrative CV model, as close as possible to the final analytical model—given

1.4 Derived Research Objectives and Structure of the Thesis

11

computational demands, will be established, and investigated with regard to additive and multiplicative EV effects. Analogous to the theoretical part, the generic models will be particularized in chapter 9 to the moderator variables of gender, prior knowledge, and course design after group invariance testing. Section 9.5 concludes with a more detailed view on group-specific interaction effects and design-specific group effects. In chapter 10, recurring patterns from the structural analyses will be summarized and discussed with regard to implications for the implementation of formative assessments in university practice.

2

Knowledge Acquisition and Transmission in the Domain of Statistics

2.1

Particularization of Statistical Reasoning to the Study Context in Depth and Breadth

2.1.1

Hierarchical Models of Statistics Reasoning and its Transition from Computational to Conceptual Understanding

Statistical reasoning is intended to be fostered throughout the semester in alternation with students’ motivational and emotional development. The segmentation of the complex statistics knowledge acquisition process into different areas and stages is to substantiate and delineate the relevant areas and levels of statistical knowledge addressed within the context of this study. International research on theoretically and empirically derived developmental models of statistical knowledge had surged by the aughts of the 21st century (Garfield et al., 2014). In general, the models intended to structure statistical knowledge in terms of competency areas (domain and process models) and in terms of cognitive stages (hierarchical models).

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-658-41620-1_2.

© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 A. Maur, Electronic Feedback in Large University Statistics Courses, https://doi.org/10.1007/978-3-658-41620-1_2

13

14

2

Knowledge Acquisition and Transmission in the Domain …

The most renowned statistics competency model was delineated by Garfield and Ben-Zvi (2007, p. 382) and differentiates between statistical literacy, reasoning, and thinking (LRT framework)1 . Statistical literacy refers to the ability of understanding, basal statistical terminology as well as interpreting, depicting, and communicating data-based information and concepts2 (Rumsey, 2002). Statistical reasoning refers to the ability of profoundly articulating with statistical concepts, holistically understanding, and explaining statistical processes, putting them together in a meaningful overall context thus referring to conceptual understanding (Ben-Zvi, 2018). Finally, statistical thinking is conceptualized as a deep understanding of when, how, and why to apply statistical procedures and assessments, entailing their theory-driven, context-adequate, well-planned, and reflective use in consideration of their limitations (Chance, 2002). While this model is based on a hierarchy assuming literacy to be the basis for reasoning and thinking and thinking being the highest stage, all the three dimensions are terminologically rather indistinct. This reflects well that researchers research reached no consensus in defining these terms (see also Garfield & Ben-Zvi, 2007; Martin et al., 2017, p. 455; Sabbag et al., 2018), which complicates the classification of statistical competency profiles or assessment tasks3 . An alternate model of statistical reasoning from Huberty et al. (1993) is more concrete about the expected skills and activities and empirically determined three consecutive dimensions of statistical knowledge, differentiating between procedural knowledge involving the performance arithmetic operations, propositional knowledge about the underlying statistical concepts, and finally, conceptual understanding involving the ability to establish links between and reflect upon the conducted operations, concepts, and propositions. Apart from routine applications, procedural skills can also relate to problem-solving as well as the knowledge of when and under which conditions to use the heuristics (Anderson & Krathwohl, 2001, p. 52). Broers elaborates that procedural knowledge is seen as a core skill that can basically be acquired by the mere application of mathematical or statistical operations and formulae (2002, p. 324), whereas 1

This competency model could also be considered a domain or process model because it has no strict hierarchy. Due to the absent content-related concretizations and the focus on cognitive processes, it is subsumed under the hierarchical models. 2 Concepts are abstract, universal mental representations which arise when learners integrate facts by recurring on existing schema to construct meaning from new information to be flexibly transferred to other situations (Stern et al., 2017). 3 In a follow-up study concurrently validating statistical literacy and reasoning by means of think-aloud interviews suggested that the best fitting model for both factors is unidimensional (Sabbag et al., 2018).

2.1 Particularization of Statistical Reasoning to the Study Context in Depth …

15

deeper understanding requires an awareness about statistical concepts and their interrelations. Based on the constructivist rationale, linking separate concepts and propositions with each other, or with already existing schema considers that learners have no tabula rasa state of mind (Parr et al., 2019, p. 2). Hence, concept formation is deemed a maturation process lifting the knowledge to a higher stage of abstraction where it can be more easily retrieved (delMas, 2005, p. 81). Another model from Garfield (2002) takes on the development from procedural to propositional and conceptual understanding, but preceded an initial level of verbal argumentation, at which students can only verbally understand statistical concepts, but cannot actively apply them (Chan et al., 2016). Subsequently, in the stage of transitional and procedural reasoning, students can grasp one or more features of a statistical concept without linking them to one another (Chan & Ismail, 2014, p. 4338). At the level of integrated process reasoning, similar to the conceptual stage from Huberty et al. (1993), students holistically understand the process of statistical heuristics, can link them with each other, and have confidence in making predictions as well as explaining processes in their own words (Garfield, 2002). A shortcoming of this model is that it assumes knowledge acquisition to be linear even though some of the stages are strongly intertwined, so that thought processes are difficult to clearly allocate (Chance et al., 2005)4 . Accordingly, videotape analyses and clinical interviews revealed that learners cannot be assigned to one particular level of reasoning (Chance et al., 2005, p. 308; Chan et al., 2016), suggesting an insufficient distinctiveness of these stages. Apart from that, only few studies based on these five levels, or their further validation, have been conducted (Chan et al., 2016). The Action-Process-Object-Scheme (APOS) follows a similar understanding of developmental stages in statistical knowledge acquisition (Clark et al., 2007). The action level refers to the rigid application of statistical formulae (i.e., computing statistical measures without being able to understand its implications, limitations etc.; Chance et al., 2005, p. 312). The subsequent process level entails the internalization of the concept as regards content, implying that it can be adapted and retrieved more easily for the application in different contexts. The next higher level (object) requires a growing awareness and scrutinizing of the applied statistical procedures in the overall context, leading to the highest 4

For instance, the difference between the transitional and procedural level is that one or two versus all relevant aspects of a statistical concept can be recognized. The subsequent level of integrated procedural argumentation additionally requires the ability to articulate the respective process. If learners, for instance, recall the one or two most relevant aspects and is able to articulate their choice, it is questionable whether they should be assigned to the transitional level or higher.

16

2

Knowledge Acquisition and Transmission in the Domain …

level (schema), which refers to the autonomous construction and intertwining of various statistical concepts to solve arithmetic problems. The aforementioned five-stage developmental model of statistical reasoning (Garfield, 2002) is structured similarly to the before mentioned models in such a way that the learner first can only identify and define the relevant concepts in an isolated manner. In the higher developmental stages, the learner is able to put the concepts together in an overall context and to apply them flexibly in various stochastics contexts. The “Profile of Statistical Understanding” (Reading, 2002; Watson & Moritz, 2000) is based on the “Structure of Observed Learning Outcomes” (SOLO) model (Biggs & Collis, 1982) and further elaborates on the idea of interrelated knowledge from the aforementioned models for higher developmental stages. More concretely, the unistructural/transitional level implies that learners recall one single aspect (i.e., describing individual data points, computing single measures; Chan et al., 2016, p. 30) while the multistructural level involves the use of two or more concepts without necessarily reintegrating them with each other or fully grasping the context (i.e., normal distribution and population shape are correctly recalled in separation; Chan et al., 2016). These levels are thus similar to the transitional and procedural reasoning level (Garfield, 2002). Relational tasks in Reading’s model (2002) are similar to skills required for the integrated process reasoning (Garfield, 2002). Tasks or abilities at this abstract level are highest sophisticated because they require relating, comparing, generalizing, and integrating two or more concepts into a coherent whole (i.e., larger sample size leads to lower variability and a more normal sampling distribution due to the Central Limit Theorem; comparison of mean and median and their impact on changing distributional shapes, etc.; Chan et al., 2016; Garfield, 2002, p. 2). From the national perspective, the scholastic standards for mathematics by the Standing Conference of Ministers of Education (KMK Bonn and Berlin, 2015) formulated content-related competences for the domain of mathematics, which can mostly be brought in line with the aforementioned stages for statistical reasoning. The strongest difference to most of the other international models is that the competence framework places a stronger emphasis on verbal reasoning skills. Hence, mathematical argumentation skills entail the ability to understand mathematical-statistical propositions and bringing in plausible and adequate substantiations (KMK Bonn and Berlin, 2015, p. 14). This competency level is meant for exclusively verbal argumentation and is thus comparable to Garfield’s verbal argumentation level (2002). Mathematical problem-solving skills refer to the identification of statistical problems, the determination of adequate solution heuristics and the application of these procedures. The execution of these procedures involves routine or complex procedures with technical, formal, or symbolic

2.1 Particularization of Statistical Reasoning to the Study Context in Depth …

17

statistical entities (KMK Bonn and Berlin, 2015, p. 15). Statistical modelling skills imply that individuals can actively apply statistical concepts on real-world situations for structuring and simplification as well as the interpretation and scrutinization of the constructed models. Finally, the ability of communicating statistically involves the appropriate selection of information (i.e., from journals, news media) and their integration in the own chain of reasoning for subsequent verbal or written communication. Most of the presented hierarchical models were formulated in relation to specific statistical topics, such as data analysis (Jones et al. 2001), sampling distributions (Garfield, 2002), averages (Watson & Moritz, 2000), or the central limit theorem and standard deviation (Clark et al., 2007), so that available descriptors are partly limited to these topics. Moreover, many of the models were based on or tested with competences of elementary or secondary school students (Reading, 2002; Watson & Moritz, 2000; Jones et al., 2001). Hence, some levels of statistical reasoning might (i.e., idiosyncratic, pre-structural) be applicable to university students only to a limited extent. Other hierarchical models had some overlap between the different stages (i.e., LRT framework and model of statistical reasoning), rendering a categorization of tasks and thought processes difficult. Table 2.1 provides an overview of differences and similarities between the presented statistics hierarchical models by summarizing comparable stages with their respective names and exemplary skills to identify core processes contributing to common understanding of statistical knowledge acquisition processes. The intention of this assembly is to derive a sufficiently distinct representative framework of the statistical knowledge acquisition process, the unifying and discriminatory aspects will be abstracted from each of the presented hierarchical models. The conspectus of the different hierarchical models shows that, in large part, they share common aspects even though stages across the models were named differently or formulated more concretely. Most of the process models have in common that they start with a stage involving the more or less rigid application of statistical procedures (i.e., procedural, transitional, action level), which could be denotes as algorithmic skills. The next higher level usually refers to the meaningful application of the learnt procedures in relation to the propositional context (i.e., propositional level, process level). The national competence areas (KMK Bonn and Berlin, 2015) were more concrete in broadening the procedural level to consider problem-solving, i.e., the identification of a problem and the choice of an adequate solution procedure. All of the models have a highest level referring to conceptual understanding. This stage differs from the prior levels in the fact that it is abstracted up from more separately acquired procedural, factual knowledge for learners to grasp the interrelatedness among different concepts (Stern

Level 1: Algorithmic skills

Level 2: Declarative and procedural skills

Action: • Understanding is limited to the rigid application of statistical operations related to a certain concept and in response to explicit external cues

Process: • Internalization of an action into the individuals’ cognitive repertoire, such that it is perceived as being under mindful control; involves the ability to rethink and manipulate process steps of a certain procedure

Verbal reasoning: Transitional (and procedural) reasoning: • Verbal understanding • Understanding of one or more (or all) features of the process without fully linking them of statistical concepts together and without fully understanding the underlying process. without the necessity or ability of concrete arithmetical operations Clark et al. (2007): APOS framework

(continued)

Object: • Awareness of the holistic process, impact of manipulation and scrutinizing the operations and putting them in the overall context Schema: • Conscious or subconscious integration of a concept into the own cognitive framework in connection with other related entities

Integrated process reasoning • Students holistically understands statistical process, coordinate the rules, and predict its behavior.

Conceptual understanding: • Ability to integrate and connect facts and procedures into an overarching cognitive structure (i.e., variability, average, and shape) • Acquisition of meaningful knowledge for application in new, complex, less familiar situations • Reflecting on the operations, concepts at a higher stage of abstraction

Level 3: Conceptual understanding

2

Garfield (2002): Model of Statistical Reasoning

Procedural knowledge: Propositional knowledge: • Knowledge of how to perform calculations • Knowledge about (simpler) statistical facts • Application of statistical formulae and and propositions, statistical concepts, and arithmetic operations, manipulation of numbers their rationale related to the procedures • Understanding of the procedures not necessary • Knowledge of isolated propositions

Broers (2002); Huberty et al., (1993): Unnamed model

[Verbal reasoning]:

Table 2.1 Comparative Overview of Statistical Reasoning Models according to Aspiration Level

18 Knowledge Acquisition and Transmission in the Domain …

Level 1: Algorithmic skills

Problem-solving skills: • Application of mathematical procedures including problem identification and determination of adequate solutions in routine or complex situations

Level 2: Procedural skills

Level 1: Algorithmic skills

Level 2: Procedural skills

Handling notational or symbolic entities: • Application of mathematical-statistical operations and heuristics in routine or complex situations

Uni- (or multi)structural: • Application and understanding of one (or more) statistical concept(s) in isolation

Reading (2002): Profile of Statistical Understanding / SOLO taxonomy

Argumentation and communication skills: • Arguing adequately with statistical concepts • Appropriate selection of information and integration in the own chain of reasoning (verbal/written) [Verbal reasoning]:

KMK Bonn and Berlin (2015): Mathematical competence areas

[Verbal reasoning]:

Table 2.1 (continued)

(continued)

Relational level: • Relating, comparing or integration of two or more concepts into an overarching framework Extended abstract level: • Generalization and conceptualization at a higher abstract level

Level 3: Conceptual understanding

Modelling: • Transferring concepts to real-world situations, contextualization, interpretation, and reflection of the concepts used for the constructed models; construction and evaluation of complete models

Level 3: Conceptual understanding

2.1 Particularization of Statistical Reasoning to the Study Context in Depth … 19

Reasoning: • Reasoning with statistical concepts, interrelating concepts, combining ideas Thinking: • When, how, and why to apply statistical procedures and assessment according to theory, context, and critical reflection

2

Note. Source: Author’s own based on the above-mentioned works. [verbal reasoning]: This competency is not considered a part of the consecutive stages, but an independent skill with a stronger communicative focus relating to statistical “reasoning” in the narrower sense.

Literacy: • Understanding, using, interpreting depicting, data-based information and basic language and tools of statistics

Garfield & Ben-Zvi (2007): LRT framework

Table 2.1 (continued)

20 Knowledge Acquisition and Transmission in the Domain …

2.1 Particularization of Statistical Reasoning to the Study Context in Depth …

21

et al., 2017). Conceptual understanding is targeted when students are expected to systematize, transfer, and reflect on complex tasks in unfamiliar situations with recourse to meaningfully interconnected statistical concepts (Stern et al., 2017; Anderson & Krathwohl, 2001, p. 48). In order to locate and classify students’ level of statistical reasoning within the different frameworks, cognitive interviews were conducted in most studies to lay bare students’ current thought processes, such as in Broers (2002), Garfield (2002), Watson and Moritz (2000), Derry et al. (2000), and Clark et al. (2007). As the aim of the study is not to classify students’ cognitive knowledge level, but the longitudinal investigation of feedback and motivational-emotional development with a large sample, this methodology is not part of the present investigation. The purpose of the overall framework in the given context is to delineate the skill and competency levels addressed by the assessment tasks in the electronic quizzes and the final exam. Before the relevant facets of the competency framework will be delineated for the present study, the different statistical content areas, and existing standardized measures to assess them will be depicted and compared to the relevant national and institutional curricular of the assessed university course.

2.1.2

Content Models of Statistics Reasoning and their Measurement

Most of the hierarchical models mentioned above were derived qualitatively from cognitive interviews without offering ways to develop quantitative operationalizations for standardized assessments (Zieffler et al., 2008). There have been some statistics-related content models including standardized test items, focusing on specific content dimensions while only few of them cover a wider range of content areas. Models and assessments addressing only a single content area, for instance, relate to bivariate and distributional understanding (focus on correlation coefficient, histograms; Zieffler & Garfield, 2009), sampling distributions and variability (Garfield, 2002), t- and χ2 tests as well as correlational analyses (Quilici & Mayer, 2002), probability theory (Hirsch & O’Donnell, 2001; Broers, 2002), means and averages (Watson & Moritz, 2000) with 5 to 20 standardized items. Other content models structure statistics knowledge with a stronger emphasis on data exploration and data literacy. They usually relate the contents to a research process, starting with the definition of real-world problems, their statistical reformulation (i.e., transnumeration), followed by the application of the necessary statistical procedures and a final evaluation within the context of the

22

2

Knowledge Acquisition and Transmission in the Domain …

initial problem (OECD, 2014, p. 42). Similarly, the Statistical Thinking Framework (Jones et al., 2001; Chan & Ismail, 2014, p. 43395 ) structures statistical analyses according to the domains of data collection and preparation, description, analysis, as well as interpretation and interference with 39 items. This framework however neglects the steps prior to and following the handling of the gathered data. By contrast, the more holistic PPDAC (problem, plan, data, analysis, conclusion) cycle (Wild & Pfannkuch, 1999) also accounts for the initial problem definition, assessment planning up until the post-analytical conclusions leading to new research questions. Regarding the national state of research, there also is no unanimously agreed on content model for statistics education at tertiary institutions. For secondary schools, the KMK Bonn and Berlin provides a statistics-related content model within the domain of mathematics, with a strong focus on stochastics (2015, p. 21). The lack in content coverage is aggravated by the fact that the above-mentioned test instruments have mostly only been tested on smaller samples (< 200) and have not been subjected to any systematic follow-up validation. In contrast to these specific solutions, a smaller number of broader-ranging models including operationalized test items exist, three of which will be covered below. The Statistics Concept Inventory (SCI: Stone et al., 2003) includes 32 multiple choice questions relating to the content domains of probability, descriptive statistics, and inferential statistics. However, some important topics, such as correlation and regression analyses, are neglected and the covered content is based on curricula of university engineering courses. Most importantly, the items address conceptual understanding and neglect the second level of statistical reasoning (procedural and problem-solving ability). The Statistical Reasoning Assessment (SRA: Garfield & Chance, 2000) uses 20 multiple-choice questions to distinguish between correct and incorrect statistical reasoning of freshmen and cover sampling theory, bivariate data, data representation, averages, and probability. Garfield and Ben-Zvi (2007) provide another test with 40 items, the “Comprehensive Reasoning Assessment” (CAOS), covering a broader range of contents, i.e., sampling distributions, bivariate data, inferential statistics, and probability— with a stronger focus on variability (Sabbag et al., 2018, p. 143). A few validation studies with larger samples of freshmen were conducted for the SCI, SRA, and 5

For instance, the difference between the transitional and procedural level is that one or two versus all relevant aspects of a statistical concept can be recognized. The subsequent level of integrated procedural argumentation additionally requires the ability to articulate the respective process. If learners, for instance, recall the one or two most relevant aspects and is able to articulate their choice, it is questionable whether they should be assigned to the transitional level or higher.

2.1 Particularization of Statistical Reasoning to the Study Context in Depth …

23

CAOS, which however allude to shortcomings in their measurement of statistical reasoning. Content validity of the SCI was established by means of expert consultations with the engineering faculty, the advanced placement course outline, and textbook analyses (Allen et al., 2004, p. 2). The measurement instrument was thus based on the specifics of an engineering statistics course at one institution (Allen et al., 2004, p. 6; DelMas et al., 2007, p. 29). Follow-up studies for the SCI accordingly suggest a lacking transferability to other degree courses outside the engineering context. For instance, convergent validity between statistics course grades and the SCI score could only be established for the engineering, but not for other disciplines (Allen et al., 2004; Allen et al., 2004, p. 4). Moreover, the Cronbach alpha values for the scale in the pilot study was only mediocre (α = .579; Stone et al., 2003) and also highly variable when the test was administered at other tertiary institutions (Allen et al., 2004). While domain experts reviewed a predefined set of items regarding content validity of the SRA, it remains unclear in how far curricular aspects were considered in the review process. Despite the broader range of topics, about two thirds of the items have a strong focus on probability (DelMas et al., 2007, p. 29). Several studies found low internal consistency for either the aggregated items, or separate correct and incorrect reasoning scales, suggesting that the items do not warrant meaningful interpretability of the aggregated values for statistical reasoning ability6 (Garfield & Chance, 2000, p. 18; Tempelaar, van der Loeff, et al., 2007, p. 91). Correlating the SRA test score as one factor with students’ grade point average only yielded low to moderate correlations regarding criterion validation7 (Garfield & Chance, 2000, p. 18; Tempelaar, van der Loeff, et al., 2007, p. 83). The correlations between the SRA and attitudes towards statistics were also found to be low and (Tempelaar, van der Loeff, et al., 2007) and thus probably inappropriate for the motivational and cognitive assessment framework of this study. CAOS has been validated more extensively prior to distribution as regards content. The item pool was reviewed by 30 domain experts within four 6

Tempelaar, van der Loeff, et al. (2007, p. 91) found that the low correlations may stem from mutually exclusive multiple-choice items and thus modeled an alternative seven-factor measurement model based on content relatedness with acceptable fit indices, but no follow-up validation in further studies. 7 Garfield and Chance (2000, p. 18) assume that the low correlations between the course grades and the SRA scale could be dependent on different assessment conditions. For instance, diligent students might recur on effort-based study techniques leading to surface learning, which might be favorable for bonus tasks included in course performance, but detrimental to momentary assessments of statistical reasoning (as shown by Tempelaar et al., 2006).

24

2

Knowledge Acquisition and Transmission in the Domain …

iterative pretests with regard to curricular completeness of introductory statistics courses as well as item formulation, a final test instrument consisting of 40 multiple-choice items yielded good Cronbach alpha values around .80 (Garfield et al., 2007; Zieffler & Garfield, 2009, p. 16). Despite the very detailed description of the validation process in this study, it is again not clear whether and on the basis of which curricular yardstick these evaluations are made. Apart from this content validation, no further validation study has been conducted with the test instrument. In sum, standardized instruments to assess statistical reasoning are mostly too narrowly considered as regards content while other, more extensive, models also lack a systematic investigation of several validity criteria (Zieffler et al., 2008). The fragmentary models are siloed solutions that do not spring from a common framework, rendering it difficult to combine them to assess statistical competencies in a broader sense as regards content (Zieffler & Garfield, 2009, p. 23). Another restriction of some assessments, such as the SRA, CAOS and SCI, is that most of their items focus on conceptual understanding as well as statistical literacy, reasoning and thinking (see Section 2.1.1) instead of the procedural skills primarily taught in the targeted statistics course. The fragmentarily available solutions for the assessment of statistics reasoning necessitate a comparison with the statistics course to be investigated in order to evaluate the suitability of these instruments in the context of this study. Therefore, in the next step, several renowned statistics textbooks and the syllabus of the targeted introductory statistics course will be consulted as a benchmark to finally assess the content coverage of the mostly content-specific content models.

2.1.3

Comparison between a Preliminary Statistics Core Curriculum and the Content Covered in the Investigated Statistics Course

For a better contextualization of the statistical contents which are relevant for the introductory statistics course in economics (“Statistics I”), several statistical textbooks from the social and economic science disciplines were considered to identify nationwide important core domains. Table 2.2 depicts the recurring statistics topics from ten renowned statistics textbooks in the social and economics disciplines. The overview suggests that statistics topics are in large part homogeneous across the disciplines of economics, sociology, and political science. Core

2.1 Particularization of Statistical Reasoning to the Study Context in Depth …

25

Table 2.2 Statistical Core Curriculum Based on Ten Textbooks in Comparison to the Investigated Statistics Course Textbook # (see Appendix 1 in the electronic supplementary material) Topic

1

Discipline

Social sciences

2

3

4

Statistical features

5

X

X

X

X

X

X

X

X

X

Two-dimensional distributions

X

X

X

X

X

Linear regression

X

X

X

X

Probability functions

X

X

Point estimation

X

X

Interval estimation

X

Random variables

X

Hypotheses tests

7

X

Statistical measures

Combinatorics and probability

6

8

9

10

C

X

X



X

X

X

X

X

Economics X

X

X

X

X

X

X

X



X

X

X

X

X

X

X



X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

χ2 test

X

X

X

X

X

X

X

X

T-test

X

X

X

F-test

X

Analysis of variance

X

Time-series analyses Price and value indices

X

✓ ✓

X

X

X

X X

X

X

X

X

X

X

X

X

X



X



Note. X = Topic was covered in a whole chapter of the respective textbook. C = topic is a part of the syllabus of the investigated “Statistics I” course. Topics that were covered in only one or two of the ten textbooks were not included.

contents are, for instance, statistical measures, correlation, regression, combinatorics, and probability theory. Economics textbooks place a stronger focus on introducing basal statistics concepts (such as sample, population, variables, scales, distribution, frequencies), time-series analyses and price/value indices while social science textbooks more often consider bivariate distributions (i.e., scatterplots, statistical independency, contingency). From these omitted topics, social sciences books have a stronger methodological focus (i.e., empirical designs, assessment methods, data collection and preparation) whereas economics

26

2

Knowledge Acquisition and Transmission in the Domain …

textbooks deal with more specific challenges, such as heteroscedasticity, autocorrelation, and volatility. Apart from these minor deviations, the majority of the statistics contents are comparable across textbooks and disciplines. The topic comparison also underlines that the “Statistics I” course targeted in this study deals with basal concepts of statistical analyses, descriptive statistics, bivariate data, regression, probability theory, time-series analyses, and indices. The topics with a stronger focus on inductive statistics, such as related to hypotheses tests, probability distributions, mean comparisons, point, and interval estimation are covered in another consecutive course “Statistics II”8 . When opposing the contents of the “Statistics I” course with the available measurement instruments, it becomes evident that several of them would have had to be combined to provide an adequate coverage of the complete syllabus9 . A combination of the different measurement instruments is deemed problematic because they were developed separately from each other and each of them have their own shortcomings regarding measurement and validity (see Section 2.1.2). Finally, such a combined assessment would conflict with practical considerations of the field study, such as the fact that answering additional items would consume more time within the sessions and burden students apart from the other course requirements throughout the semester. The lacking fit between the available measurement instruments and the investigated statistics course, along with the hardly possible adaptation to and implementation in the field study, necessitate the use of assessments which are more specifically tailored to the statistics course, which will be particularized in the next chapter.

2.1.4

Implications from the Hierarchical and Content Models for the Measurement of Statistics Reasoning in the Present Study

Due to the lacking fit between the measurement instruments and the syllabus as regards content, the standardized, formative, and summative assessments used in the course (i.e., four electronic quizzes and one final electronic exam) will be

8

Therefore, the content models focusing on the data literacy process (; Jones et al., 2001; OECD, 2014; Wild & Pfannkuch, 1999; see Section 2.1.2) will not be considered further in the scope of this study. 9 For instance, the BRA had to be used to assess “statistical measures”, the DRS for “twodimensional distributions”, the SCI for “linear regression” etc. while no instrument would have covered “time-series analyses” and “price and value indices”.

2.2 Impediments to the Furtherance of Statistical Reasoning in the Context …

27

used as a proxy to assess students’ statistics-related knowledge acquisition processes10 . The tasks are mostly located in the stage of procedural knowledge with a tendency to the stage of conceptual understanding (see Section 2.1.1). Mere algorithmic application of formulae was not sufficient because the tasks were always introduced with a problem-oriented contexts (i.e., software outputs, visualizations, distributions) relating to which the procedure had to be solved. Notions of conceptual understanding were also addressed in some tasks because different concepts needed to be related to each other (i.e., variance to sample distribution, GINI index to measurement series, software output to regression analyses). However, the active construction of conceptual understanding and problem-solving skills within the scope of these assessments is limited by the restrictive format of the standardized questions. While correctly performed procedures might be taken as evidence that conceptual understanding has been acquired, it is not a sufficient, but only a necessary condition (Baroody & Ginsburg, 2013, p. 79). Procedural knowledge is more easily to observe by means of standardized items than conceptual understanding because it relates to concrete and mostly linear operations and therefore still commonplace in large higher-education courses (Baroody & Ginsburg, 2013, p. 79; Chans et al., 2021). Open-ended, unstructured problem-based tasks would be needed to foster these competency facets, but their implementation in lectures is impracticable whereas standardized questions more likely ensure continually higher participations rates and interpersonal comparability. In consideration of the demands of the assessment tasks, statistical reasoning can be understood as statistics-related procedural skills in the context of the present study.

2.2

Impediments to the Furtherance of Statistical Reasoning in the Context of Teaching and Potential Expedients

Challenges that hinder students from acquiring the above depicted essential statistics-related procedural skills and conceptual understanding already begin with the wrong application of routine heuristics because concepts are only recalled fragmentarily or inaccurately (Chance et al., 2005, p. 296; Smith, 2008). Based on cognitive pre-post-interviews with ten psychology students on their 10

Garfield adds for consideration that performance scores might not necessarily be a reliable indicator for statistical reasoning (2002). In the later analyses, reliability coefficients and validity aspects based on other variables (AERA, 2014) will be considered to evaluate the appropriateness of these scores.

28

2

Knowledge Acquisition and Transmission in the Domain …

understanding of regression analyses, Broers found that wrong answers were mostly due to recently acquired knowledge that had not yet been firmly grounded (2002, p. 338). Moreover, even students with best grades were not able to flexibly use or conceptually understand basal statistical measures apart from their application only a few weeks after their exam (Clark et al., 2007). The high varieties of heuristics and formulae along with the different operating conditions and contexts makes for a high number of heterogenous misconceptions resistant to change after formal instruction and increased by the end of the semester (Garfield et al., 2007, p. 34; Hirsch & O’Donnell, 2001). Leaving misconceptions uncorrected hinders procedural and conceptual knowledge acquisition because the mistaken presuppositions interfere with newly acquired concepts and are compounded with subsequent lessons (Broers, 2002, p. 327; Cook & Babon, 2017, p. 12). Researchers determined widespread areas of statistical misconceptions, such as the inference of causality from correlation, or inference of representativeness from a sampling distribution (Garfield & Ahlgren, 1988, p. 52). Studies suggest that misconceptions are often resilient and arise because naïve intuitions are cognitively more easily retrieved than the necessary specialist knowledge (Garfield, 1995, p. 27). Their correction therefore is difficult and takes time (Chan et al., 2016; Martin et al., 2017, p. 456; Zieffler & Garfield, 2008). These overt challenges had given rise to claims for improving pedagogical practice in ways that remedy misconceptions and the formation of inert knowledge (Garfield & Ahlgren, 1988, p. 46; Garfield & Ben-Zvi, 2007, p. 389). Therefore, began to shift the lens from the mere categorization of statistical misconceptions to investigate beneficial instructional activities, methods, or other types of pedagogical interventions (Garfield & Ben-Zvi, 2007, p. 376). For instance, more activity-oriented and collaborative course designs were shown to foster better understanding of the material compared to traditional and passive lectures, which may have been related to students constructing their knowledge more autonomously (Garfield & Ben-Zvi, 2007, p. 379; Banfield & Wilkerson, 2014, p. 291). Fostering statistical reasoning was shown to positively depend on regular hands-on tasks in varying contexts with the aid of verbal, visual and practical cues (Broers, 2002, p. 339; Garfield & Ben-Zvi, 2007, p. 387). Several studies therefore investigated the use of illustrative computer simulations and technological tools that visualize datasets. Integrated knowledge can be fostered by allowing students to see the immediate impact of manipulations of statistics on dynamic graphs (Garfield & Ben-Zvi, 2007, p. 388). Moreover, computerized learning became interesting due to the opportunity to provide students with automated feedback (Cassady & Gridley, 2005, p. 8; Garfield, 2002). Chance et al. however found that the autotelic interaction with software does not prevent most

2.2 Impediments to the Furtherance of Statistical Reasoning in the Context …

29

students from having difficulties in reasoning about sampling distributions (2005, p. 299). In a further pre-post approach, they let students first unpreparedly answer test items and then let them conduct the analyses in the software. Afterwards, students were to compare the insights from their active learning process with the priorly, and partly wrong, answers. This explicit confrontation with the former misconceptions was more successful than before, leading to statistically significant performance enhancement (Chance et al., 2005). In another comparison group design, Lovett found that working with statistical packages is only beneficial with additional feedback because else, students rely on their own abilities and might not recognize their own errors (2001, p. 46), Another constructivist measure therefore lies in the explicit consideration of the above-mentioned misconceptions as a central element through which feedback helps to correct mistakes and thus minimizing the risk of acquiring misconceived knowledge (Conrad, 2020, p. 47; Enders et al., 2021, p. 93; Lipnevich & Panadero, 2021, p. 5). The feedback effect is argued to foster conceptual understanding by addressing errors or misconceptions immediately when they occur during practice and helps learners to become co-owners of their learning process even in the anonymity of a large lecture (Morris et al., 2021, p. 3; Resnik & Dewaele, 2021, p. 23). To that effect, naïve preconceptions are not conceived as learning barriers, but they are actively incorporated in the learning process to be revisited and refined to acquire more proficient reasoning. Linking newly acquired knowledge to previous suppositions makes them more easily accessible in the network of conceptual knowledge after being corrected (delMas, 2005, p. 81; see Section 2.1.1; Garfield & Ben-Zvi, 2007, p. 387). Approaches to confront and correct misconceptions are referred to as debiasing or interference, for instance (Broers, 2002, p. 327; Finney & Schraw, 2003, p. 182). Research findings thereby suggest that instructional, verbal persuasion or providing counterevidence had only limited success, whereas the implementation of pre-posttests with corrective feedback proved more successful in the context of statistical reasoning (first empirical evidence was found in Garfield & BenZvi, 2007, p. 388; Mevarech, 1983). Empirical findings in statistics education and other domains have indicated that such cognitive perturbations contribute to significant improvement in performance because it helps making sense of and overcoming one’s misconceptions (Chance et al., 2004; Marzano et al., 2001; Mevarech, 1983; Moreno, 2004). Hence, activities should allow students to evaluate knowledge discrepancies by themselves to see the contradiction between their predictions and the expected result. For instance, multiple choice questions with erroneous, but plausible distractors (“foils” or “lure items”) to create conceptual inconsistencies were shown to unveil and address common misconceptions

30

2

Knowledge Acquisition and Transmission in the Domain …

more clearly and to potentiate learning by means of incorrect retrieval attempts (Enders et al., 2021, p. 94; Hirsch & O’Donnell, 2011; Martin et al., 2017, p. 456; Lipnevich & Panadero, 2021, p. 9). This finding is rooted in the concept of mindfulness (Bangert-Drowns et al., 1991, p. 217; Kulhavy & Stock, 1989), according to which learners mindfully process tasks, usually with some degree of certitude, even if guessing and intuitions are involved. Checking and refutation of such erroneous assumptions were shown to result in deeper understanding and deeper reflection of these discrepancies (Wisniewski et al., 2020, p. 1). The general set-up of public degree courses, in which statistics is usually implemented as a large lecture with a limited amount of credit points, however, renders the fulfilment of these instructional conditions highly challenging (Cai et al., 2018; Cleary, 2008, p. 154; Hood et al., 2021). Large lectures render it difficult to lay bare or address individual misconceptions of hundreds of students with different social, educational, and cultural backgrounds, let alone taking concrete and action-oriented individualized training measures to correct them. Most instructors resort to passively lecturing hundreds of students to efficiently get through the overloaded curriculum in the limited timeframe of one semester (Dehghan et al., 2022, p. 2512; Nadolny & Halabi, 2015). Students anxiously pass through the course because they receive neither training opportunities nor feedback throughout the semester until the final summative exam, so that misconceptions cannot be addressed in due time, thwarting conceptual understanding (Baloglu, 2003, p. 856; Banfield & Wilkerson, 2014, p. 291; Nadolny & Halabi, 2015; Pellegrino, 2010, p. 10). Summative information however ipso facto cannot contribute to students’ motivation in ongoing learning processes and may lead to fears of failure (Cai et al., 2018, p. 435). The absence of frequent feedback mechanisms causes students to drag their misconceptions along uncorrected throughout the semester and tend to overestimate their self-efficacy to learn without seeing need for improvement (Zimmerman, 2013, p. 145). The extensive consequence of neglecting time for early study-related activities ultimately is a worse grade, which has been attested in several studies (Self, 2013, p. 39). This passive way of knowledge transmission deludes students to not attend classes regularly, to spend time practicing incorrect skills or to postpone learning until shortly before the final exam. Integration of technology is deemed to be one of the few trends to narrow the gap between instructional practice and self-determined learning in large lectures (Zimmerman, 2013, p. 145). On grounds of advancing technologies, Lai and Hwang state that the interaction between students’ motivation and learning behavior is frequently conveyed by automated technological tools (2016, p. 128). Learning systems and tools enable the provision of regular, timely, corrective

2.2 Impediments to the Furtherance of Statistical Reasoning in the Context …

31

feedback to motivate a large student group for early learning in large, prohibitive educational contexts (Bates & Galloway, 2012; Evans, 2013; Nadolny & Halabi, 2015). In the context of this study, electronic quizzes were used for the provision of automatic formative assessment throughout the semester due to their timeand resource-efficient implementation for large group assessments (Enders et al., 2021, p. 91; Kleij et al., 2015, p. 479). On grounds of the advancing technologies, Lai and Hwang state that the interaction between students’ psychological views and learning behavior is frequently conveyed by automated technological tools (2016, p. 128). Computer instruction is likely to foster proactive learning as it offers easy and spontaneous accessible learning opportunities that can be stored to lay bare one’s strengths and deficiencies (Zimmerman, 2013, p. 145). However, only few institutions tap into the consistent implementation of information and communication technology or lack a clear top-down institutional strategy (Bälter et al., 2013, p. 234), even though nowadays any institutional learning management system offers quiz modules that could be used without added expenses and with little effort and automated scoring (Day et al., 2018, p. 922). Moreover, only few empirical studies in the domain of statistics education have been conducted or have documented performance-enhancing feedback effects (Chance et al., 2004; Day et al., 2018, p. 918; Lovett, 2001; Ward, 2004). Hence, theoretical, and empirical relationships between feedback and knowledge acquisition will be considered in the next chapter.

3

A Model for Reciprocal Interrelations between Feedback and Statistics Reasoning

3.1

Formative Feedback and Academic Achievement

3.1.1

The Significance of Formative Feedback for Academic Achievement in Theoretical and Empirical Research

The above-depicted statistics procedural reasoning is operationalized formatively by means of electronic quizzes throughout the semester and by means of a summative exam at the end of the semester. Despite definitional vagueness discussed in various studies, the common denominator of formative feedback is that it communicates assessment information by an external entity to help students modify their thought processes to reduce knowledge gaps between their present state and the desired learning objectives (Cai et al., 2018, p. 435; Kleij et al., 2015, p. 476; Lee et al., 2020, p. 126). By means of this information and its timing, formative feedback ipso facto contains a motivational aspect that is expected to lead to increased effort in the scope of discrepancy-reducing strategies (Hattie & Clarke, 2019, p. 3). Due to the absence of a generally accepted model, some of the most influential attempts to characterize the performance-enhancing feedback reception process will be delineated starting with the initial models that emerged in the late 20th century and had a more behaviorist stance and were based upon cybernetic paradigms (Lipnevich & Panadero, 2021, p. 12; Seifried & Sembill, 2005, p. 657). Most of these models have a common basis in denoting feedback as information from an output creating a state of disequilibrium looped back into the system. Feedback was thereby seen as the main element that triggers a certain behavior (Panadero & Lipnevich, 2022, p. 9). The loop represents the learners’ interpretation on the outcome after internal monitoring (i.e., correct/incorrect, satisfying/unsatisfying), thus engaging learners to adapt their learning or actions

© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 A. Maur, Electronic Feedback in Large University Statistics Courses, https://doi.org/10.1007/978-3-658-41620-1_3

33

34

3

A Model for Reciprocal Interrelations …

with the aim to minimize existing gaps (Lipnevich & Panadero, 2021, p. 8). Control theory, for instance, presumes that the disclosure of challenging discrepancies in knowledge motivates learners to reduce the gap between the current and necessary level of knowledge (Pritchard et al., 2013, p. 235). Carver and Scheier (2000) took up this approach by introducing the concepts of input function (i.e., perception, actual performance), reference value (i.e., goals, desired course performance) comparator, and output function (2000, p. 43), which is depicted in Figure 3.1.

Reference: Goal, desired outcome (e.g., best possible quiz score)

input > reference Comparator/ feedback

Input: Actual performance (e.g., actual quiz score)

input = reference

input < reference

reduce / maintain learning behavior discrepancyreducing behaviour

Progress of discrepancy reduction

Figure 3.1 Carver and Scheier’s Feedback Model (2000). (Note. Source: Author’s own based on the verbal elaborations of Carver and Scheier (2000))

The more traditional feedback models usually assumed that individuals strive for maintaining conformity between the reference and the input, implying that a negative feedback loop is initiated when there is a discrepancy between input and reference. Such a discrepancy provokes a change in the output behavior with the aim of reducing the discovered discrepancies (Carver & Scheier, 2000, p. 43; Onwuegbuzie, 2003, p. 1022) while conformity between input and reference (or if the input surpasses the reference) leads to maintenance of the prior behavior. An issue thereby might however be that a discrepancy remains ignored or unobserved preempting the need for adaptation (Carver & Scheier, 2000, p. 46). This feedback control model also parenthetically acknowledges that affective reactions result as a consequence from feedback processing, i.e., confidence when input and reference conform or doubt when they deviate (Carver & Scheier, 2000, p. 51). The extent of the emotional consequences is assumed to depend on the effectiveness of the discrepancy-reducing behavior. The model does however refrain from denoting concrete dispositions and effect mechanisms and construes appraisals only as consequences of output adjustments thus ignoring that they can also function as important preconditions for the uptake and effectiveness of the received

3.1 Formative Feedback and Academic Achievement

35

feedback (Cai et al., 2018, p. 434; Fong et al., 2018, p. 238; Panadero, 2017, p. 20)1 . Apart from this mechanistic stance, theoretical support for performanceenhancing feedback effects mostly emerged from social cognitive theory, according to which self-regulated learning self-monitoring processes is guided by personal and environmental reference frames (Bandura, 2015, p. 20; Salzmann, 2015). Considering the processing of formative feedback and assessment for learning, feedback is essential evidence to scrutinize their ongoing learning outcomes and the achievement of the set goals to enhance learning (Karaman, 2021; Lee et al., 2020, p. 125). The provision of feedback, particularly in large and anonymous lectures without other mandatory intermediate knowledge tests, might be an important tool influencing the likelihood of whether knowledge gaps come to the fore and whether they will be corrected self-reflection (Bandura, 2015, p. 21). Moreover, quiz feedback contributes to clarify task- and courserelated requirements, allowing students to set themselves challenging goals for the improvement of their performance (Pritchard et al., 2013). Particularly in large classes with only one final examination, providing regular feedback allows students to estimate whether they are performing well during the semester. Thus, it renders students more capable of purposefully adapting and gearing their efforts and learning behavior towards goal attainment (Latham & Arshoff, 2015; Pritchard et al., 2013). Moreover, according to Bandura’s observational learning theory, past success or even minimal progress fosters a self-reinforcing learning behavior and a motivational persistence to continue learning towards the acquired goal (2015, p. 23). Regarding the empirical state of research, learning outcomes generally were found to improve through formative assessments in a recently updated version of Hattie’s visible learning study (Wisniewski et al., 2020) and other individual studies (Azzi et al., 2015; Bälter et al., 2013; Day et al., 2018; Jia et al., 2013; Karaman, 2021; Lee et al., 2020; Marden et al., 2013; McNulty et al., 2014). Moreover, feedback was shown to be an efficient means to correct misconceptions (Azorlosa, 2011; Lam, et al., 2011; Pritchard et al., 2013; Salzmann, 2015). Most studies investigating formative quiz effects moreover focus on K-12 students (e.g., Karaman, 2021, p. 802; Lee et al., 2020) and the domain of medicine, second language acquisition, and some STEM subjects, while numerical subjects such as mathematics and statistics are underrepresented (Day et al., 2018, p. 911; Kleij et al., 205. p. 502). Occasionally, no significant feedback effects were found 1

Due to these restrictions, the mechanistic functioning of feedback control will be extended in section 3.2).

36

3

A Model for Reciprocal Interrelations …

(Demirer & Sahin, 2013; Kleij et al., 2012). The same mixed picture comes across in several meta-analyses suggesting that feedback has a performance-enhancing effect in various domains. However, effect sizes vary considerably, which may be due to meta-analyses summarizing a large variety of types of formative feedback and suggests that not all feedback is equally effective (Adams et al., 2019, p. 317; Day et al., 2018, p. 909; Karaman, 2021, p. 802; Wisniewski et al., 2020, p. 12). Therefore, in the next chapter, specific features of formative feedback and their efficacy will be investigated and particularized to the context of the present study.

3.1.2

Design Characteristics Moderating the Feedback-Achievement Relationship

While formative feedback in general tends to have a positive impact on achievement, there are various design criteria used in different studies which might influence the performance-related effects. Two general categories of criteria for the feedback provision are (1) depth and involvement as well as (2) timing. Table 3.1 provides an overview of the different criteria with an indication of which criteria apply to the quizzes (formative feedback) and the exam (summative feedback). Apart from these variations, exam and quizzes were similar in format (i.e., electronic test) and difficulty level to enhance the potential training effects. Table 3.1 Properties of the Formative and Summative Assessments of the Investigated Statistics Course

Timing

Criteria

Characteristics

Synchronicity

• Synchronous [E] • Asynchronous [Q]

Number of assessments • Discretionary (Four quizzes, one exam) Intervals

• Regularly [Q] • End-only [E]

Time limit

• Untimed [Q] • 60 minutes [E]

Depth and involvement Feedback mediator

• Self • Teacher • Computer [Q, E] (continued)

3.1 Formative Feedback and Academic Achievement

37

Table 3.1 (continued) Criteria

Characteristics

Elaborateness

• Knowledge of result [E] • Knowledge of correct result [Q] • Elaborate feedback

Level of obligation

• Voluntary • Mandatory [Q, E]

Reward

• None • Contributes to credit [Q] • Contributes to course grade [E]

Stakes

• Low [Q]: uncontrolled processing, open book, no time limit • High [E]: counts towards GPA, proctored, closed-book assessment, time limit (60 minutes)

Note. Source: Author’s own; E/Q = indicated assessment property applies to the exam or quiz of the investigated statistics course

Starting with the timing criteria, feedback provision generally can be synchronous (i.e., live quiz, voting, tutorial) or asynchronous (electronic quiz, self-test). In the context of our study, the quizzes were to be processed asynchronously as distance learning tool in an LMS (Dabbagah & Bannan-Ritland, 2005). This was due to the fact that asynchronous quiz questions can be designed more elaborately to trigger a deeper and longer engagement with the relevant content for the formation of a holistic knowledge structure. Moreover, data storage and capture of further control variables is not possible in most live voting tools. The number of intermediate feedback opportunities provisioned throughout a semester was found to range averagely from one to ten in most studies on formative assessment (Day et al., 2018, p. 918). In a few studies, a positive relationship was reported between test frequency and final learning outcomes (i.e., Bangert-Drowns et al., 1991; Lam et al., 2011; Zimmerman, 2000, p. 88). However, the determination of an optimum number of quizzes is biased by different underlying timespans in the studies that make comparison of the assessment density difficult. A common denominator however is that continuous assessments related more positively to course grades and reduced the probability of dropping out the course compared to a traditional end-of-term assessments (Day et al., 2018, p. 919; Panadero & Lipnevich, 2022, p. 14). The meaningful alignment with the course content and even distribution of the different assessments (i.e., knowledge review of topics dealt with in the prior lecture) seems to be more

38

3

A Model for Reciprocal Interrelations …

important than the concrete number of assessments (Day et al., 2018, p. 918; Wisniewski et al., 2020, p. 12), which leads to points of time at which feedback is provisioned. Instructional measures should take place repeatedly over a longer period of time to allow for a timely discovery of prevalent misconceptions and to offer sufficient opportunities for the consolidation of the appropriate concepts (Chance et al., 2005, p. 315; Garfield & Ben-Zvi, 2007, p. 389; Pellegrino, 2010, p. 8; Zieffler et al., 2008). Comparing two feedback conditions (immediate and endonly), Lovett found that immediate feedback enabled students to intermediately perform better on difficult tasks compared to students receiving end-only feedback (2001). In a longitudinal study, Zieffler and Garfield (2009) found that the development of statistical reasoning about bivariate data was quadratic, with most of the growth in that regard happened before the formal study of this concept in class. Wrong answers were often found to result from missing pieces of prerequisite knowledge (Chance et al., 2005, p. 299). Since many statistical topics build on one another, students run the risk of falling behind schedule when even the basal concepts are acquired wrongly. Some studies indicate a decaying effect of feedback in such a way that it is particularly effective when administered early in the courses (Bälter et al., 2013; Evans, 2013; Self, 2013, p. 37), which coincides with findings on statistical reasoning indicating that students’ knowledge development is most progressive at the beginning of the semester (Zieffler & Garfield, 2009). These findings suggest that students are most susceptible to knowledgeenhancing treatments starting from the beginning of the semester, so that early recourse to and, if necessary, correction of preexisting intuitions is indispensable for them to function as a productive resource in the knowledge acquisition process (Pellegrino, 2010, p. 10). Therefore, regular feedback opportunities by means of four quizzes, starting from the third semester week, were given in the present study instead of summative end-only assessments to allow for constructive learning (see section 2.1.1; Day et al., 2018, p. 908; Gikandi et al., 2011). Concerning the intervals, in a systematic review of 33 studies on K-12 education, Lee et al. found that the so-called cycle lengths between feedback and instruction were more effective for medium-cycle intervals between instructional units compared to short cycles between lessons (2020, p. 140). Each quiz assessed the topics that were dealt with in the preceding three to four lecture units, so that the timing of the quizzes contributed to timely retrieval of recently acquired knowledge. They were to be processed by the students at their own convenience within a one-week timeframe.

3.1 Formative Feedback and Academic Achievement

39

Apart from timing-related criteria, feedback can also vary in terms of depth and involvement. Self-initiated feedback, involving metacognition and selfassessment of the own learning progress, and feedback from the teacher were documented to have larger effect sizes on performance compared computermediated, non-judgmental feedback, which however still has a small to medium and significant effect size (Enders et al., 2021, p. 93; Kamaran, 2021, p. 802; Kleij et al., 2015, p. 479; Lee et al., 2020, p. 125). As mentioned at the end of section 2.2, the large lectures present in today’s educational systems render personal (or even one-to-one feedback from the instructor) unfeasible (Kleij et al., 2015, p. 479). Hence, to provide task-related information on the learning progress, computer-based assessment is used to provide timely and individual feedback to a large number of students for a higher test efficiency—which at least comes close to the advantages of personal tutoring under the given restrictions (Kleij et al., 2015, p. 479). Moreover, feedback can be provided in different extents, ranging from knowledge of results (right or wrong, percentage correct), knowledge of correct results, to more elaborated knowledge of correct results (i.e., including explanations; Shute, 2008, p. 160). In general, in the updated meta-analysis of Hattie’s visible learning study, it was found that feedback that includes the correct answer and high-information feedback in particular, was most effective for knowledge enhancement (this is supported by further studies from Enders et al., (2021) and Wisniewski et al., (2020). A comparative study from Hooshyar et al. (2016) indicated that elaborated feedback was most effective in enhancing performance. However, they only compared it to simplest type of feedback (binary) and did not indicate the concrete elaboration of the feedback. In a literature review on formative assessment studies, it was found that elaborated content feedback did not seem to be considerably more beneficent for performance compared to the information about the correct result only (Day et al., 2018, p. 918). Compared to that, the study review by Lee et al. found that formal feedback (i.e., written) was more effective than informal (i.e., oral) feedback (2020, p. 140). In their meta-analysis on computer-based formative feedback, Kleij et al. found that elaborated feedback was more effective than providing only the correct answer while only informing about the mere correctness itself had no significant effect (2015, p. 781), which conforms to the findings of Bangert-Drowns et al. (1991, p. 228). In the context of the present study, feedback only informed about the given and the correct answer, which was documented in the LMS. Students could see their results and the correct answers only after the final submission to prevent “strategic quiz taking” (Kibble, 2007, p. 259), whereby students could have looked

40

3

A Model for Reciprocal Interrelations …

ahead the answer to reconsider their response. It also minimized the risk of surface learning resonating in the answers by simply memorizing the quiz answers (Day et al., 2018, p. 922). The level of obligation was found to be evenly divided in mandatory or voluntary formative feedback with no consistent differences in course grades depending on the degree of obligation (Day et al., 2018, p. 920). However, Kibble et al. (2011) report that students often refrain from using feedback opportunities, so that potential selection effects (i.e., based on prior knowledge) should be considered if participation is not enforced. Another variation lies in additional rewards incentivizing quiz participation, i.e., by means of bonus points for the exam or fulfilling course credit (Azzi et al., 2015, p. 418; Day et al., 2018, p. 924). The omission of additional incentives was to avoid unnatural behavior, such as cheating or extensive consultation of study materials to insure bonus points to earn a better grade (Kibble, 2007; Marden et al., 2013). As Day et al. put it, higher rewards would contradict the truly formative character because learners would focus on the means-end reward rather than deriving a need for action from their unveiled knowledge gaps (2018, p. 921; also: Butler, 1988; Deci et al., 1999, p. 653). This would also entail a shift from the constructivist to a more behaviorist view of learning (Lee et al., 2020), which would not be in harmony with the more conceptual understanding of statistical reasoning (as delineated in section 2.1.4). Accordingly, the quizzes were low stake and needed to be processed only as a requirement for exam admission. The quiz score itself was no relevant criterion for admission or the final course grade. The minimum “reward” of exam admission was to ensure sufficient participation rates while simultaneously not greatly affecting students’ natural motivation by higher incentivization. Only the high-stakes final exam was a synchronous summative electronic test conducted in on-campus computer rooms according to the institutional examination regulations and contributed to the bachelor GPA. In sum, varying effects on achievement depending on the type of feedback (Karaman, 2021; Lee et al., 2020; Kleij et al., 2015) were documented. The conditions for the present study were determined under consideration of efficiency and feasibility for large lectures (i.e., regular feedback, knowledge of correct result, mandatory without interfering much with the natural learning behavior). After reviewing feedback research over several decades, Panadero and Lipnevich (2022) concluded that the implementation of feedback, even when the above-mentioned design characteristics are considers, can be ineffective. They ascribe this to the fact that up until the late 20th century, research missed out on factoring in students’ differences in individual preconditions and predispositions that influence the actual processing of the feedback. In that regard, most models

3.2 Formative Feedback and Achievement Motivation

41

by then focused on the role of feedback as a mere act of transmission followed by a mechanistic adaptation of learning behavior towards a performance in line with desired or prescribed expectations, which was derived from the behavioristic law of effect (Lipnevich & Panadero, 2021, p. 6; Morris et al., 2021, p. 5). However, the models ignore that individuals are not only receptables of feedback, but co-constructors of the uptake contingent on their motivation and other individual characteristics (Cai et al., 2018, p. 435; Lipnevich et al., 2016, p. 171). Moreover, these models took a more organizational stance on feedback by giving binary information on the repeated demonstration of certain skills that have already been learnt. In educational settings, task processing is more complex as it aims at developing new skills by means of more or less structured situations (Lipnevich et al., 2016, p. 171). This is why the conventional, feedback-centered models made way for more student-centered models in which feedback is seen as one element in the internal processing during task performance (Panadero & Lipnevich, 2022, p. 9). In order to situate such learner-related characteristics and the educational context in the center of the theoretical model, the terminology of statistical reasoning will be extended and adapted, using the priorly explained models as jumping-off point (Lipnevich et al., 2016, p. 171; Narciss, 2008, p. 137).

3.2

Formative Feedback and Achievement Motivation

3.2.1

Broadening the Conceptualization of Statistical Reasoning to Encompass Achievement Motivation in the Uptake of Feedback

After having elaborated on statistics procedural reasoning (formative and summative), the assessment framework will be broadened to include a motivational component. The inclusion of attitudinal and motivational elements in some of the models presented in section 2.1.1 already suggests the prominent role of learning dispositions in the context of statistics education. For instance, Chance et al. (2005, p. 309) identified “confidence—the degree of certainty in choices” as a relevant component for statistical reasoning behavior. Making “correct predictions with confidence” is also mentioned in the highest level of Garfield’s model of statistical reasoning (2002, p. 9). The four-dimensional framework for statistical thinking (Wild & Pfannkuch, 1999) also devotes one dimension to learning dispositions, which, e.g., entail engagement, curiosity, and perseverance. Both

42

3

A Model for Reciprocal Interrelations …

authors also note that dispositions might be just as difficult to teach as statistical reasoning itself (Wild & Pfannkuch, 1999, p. 235) because they are innate characteristics depending on personal affinity and inclinations. This is particularly relevant for introductory statistics courses, which are oft-perceived to be an unpopular, angst-inducing subject in many degree courses (Macher et al., 2013). The cerebral cortex is involved in all higher-level inseparable processes of reasoning, emotion, and affect, so that Sembill & Seifried (2005, p. 657) advocate that inferences on learning should be based on an integrative view of these inseparable components. Weinert’s multidimensional conceptualization of competencies serves as a basis to investigate achievement motivation and emotion (AME) appraisals as desirable learning outcomes2 (2001). Following his understanding, statistical reasoning can be broadened to entail the cognitive skills and abilities, as well as the associated AME appraisals, that individuals use or recur to when solving statistical problems (Richter et al., 2015, p. 97; Weinert et al., 2011, p. 70). Thereinafter, statistics reasoning is understood to encompass these motivational-emotional components and will be referred to as AME appraisals towards statistics. This concept fits to the given study context for several reasons. First, the extent to which AME appraisals are present can be made empirically observable in academic performance contexts (i.e., while learning statistics on- and off-campus). Second, construing a competence as skills and appraisals entails that they are the result of cognitive and affective learning experiences that were accumulated over time. These components are recurred to when solving problems and can thereby also be further developed. This developmental understanding of cognitive, motivational, and emotional competence components and their longitudinal interactions (Händel et al., 2013, p. 182; Weinert et al., 2011, p. 70) fully complies with the structural models and the distinct assessment of cognitive and noncognitive competencies in this study (see section 5.2.3). Third, cognitive competence is deemed coping with and solving domain-specific problems, rendering the competence model transferable to any conceivable domain. Finally, besides the relevant cognitive skills and abilities in the scope of statistical reasoning that have already been delineated in section 2.1.4, attitudes towards statistics, entailing AME appraisals were constantly shown to play a central role in statistics education (Frenzel, et al., 2007; Macher et al., 2012; Richter et al., 2015, p. 97;

2

In their model of professional occupational competence, Baumert and Kunter (2006) also factor in such beliefs, values, and attitudes as constructs that are assumed to interact closely with the cognitive knowledge component.

3.2 Formative Feedback and Achievement Motivation

43

Tempelaar, van der Loeff, et al., 2007). Therefore, the extension of the competency terminology is necessary for the purpose of the current study, so that feedback models specifically include AME appraisals (Lipnevich & Panadero, 2021, p. 7).

3.2.2

Feedback Models Incorporating Notions of Achievement Motivation

Feedback models that started to emerge along with cognitive-constructivist theories (Lipnevich & Panadero, 2021, p. 5) no longer regarded feedback monolithic driving force of learning that automatically triggers certain behavior, but as an information to trigger internal self-reflection processes that in turn motivate knowledge construction. This goes hand in hand with the view that the activation of an action scheme is a result of personal evaluation processes according to the own basic needs to be fulfilled (Seifried & Sembill, 2005, p. 657). It also reflects the growing awareness that skills and learning environments are malleable processes that can be influenced to foster knowledge acquisition (Clark & Zimmerman, 2014). Panadero and Lipnevich (2022, p. 12) determined that the internal feedback processing under consideration of the underlying individual characteristics as well as AME appraisals are crucial mediators of the feedback-performance relationship on which students will vary. In accordance with the constructivist view, the learners play the central role during feedback reception as they actively process and react emotionally as well as cognitively under consideration of individual predispositions and instructional factors (Panadero & Lipnevich, 2022, p. 9). In the context of statistics education, some structural models have been developed to represent motivational trajectories throughout a semester (e.g., Budé et al., 2007; Wisenbaker & Scott, 1997). However, most of them only assessed pre- and post-test attitudes as well as post-test achievement, so that the number of significant effects remained limited and did not account for learning processes happening in the middle of the semester (Schau, 2003, p. 33). Moreover, these studies rather focused on analyzing the interrelations between attitude components themselves but did not consider or measure potential sources from which ascertained change between their interrelations came from (such as instructional characteristics). Following the recommendation of Lipnevich and Panadero and in conformity to the research question, only meta-analytically grounded, interactional models focusing on feedback receptivity will be considered for further analyses as they focus on how learners with different predispositions receive and react differently

44

3

A Model for Reciprocal Interrelations …

during feedback processing (2021, p. 26). Relevant assertions and findings of these models will be compiled in the chronology of the reception process in order to situate them in a more fully differentiated analysis model that takes account the recursive longitudinal effect mechanisms between skills, appraisals, and feedback. Bangert-Drowns et al. five-stage model of learning The model of Bangert-Drowns et al. (1991) was one of the first fully studentcentered models that fully acknowledges that learners mindfully, rather than automatically, process activities (Panadero & Lipnevich, 2022, p. 9). Learners rather incorporate prior knowledge, intuitions, and feelings of certitude already in their learning activity, which are then confirmed or disconfirmed (BangertDrowns et al., 1991, p. 128). Hence, different from the models mentioned in section 3.1.1, the model considers cognitive, AME appraisals not only as a reaction to feedback, but also as predispositions that also flow into the feedback reception process (Bangert-Drowns et al., 1991, p. 231). Feedback loops are initiated thereby in which the given answers are evaluated compared to the solution (Bangert-Drowns et al., 1991; Carver & Scheier, 2001). If the comparison yields a discrepancy, the learning behavior has to be modified to reduce this deviation (Lipnevich & Panadero, 2021, p. 7). Bangert-Drowns et al. (1991) bring the process full circle as the insights from the evaluated feedback set a newly adjusted cognitive-motivational state for the next feedback loop. In that way, students are more likely to become aware of their misconceptions about statistics because they are explicitly incorporated into the iterative learning process to contribute to a better consolidated knowledge network (see section 2.1.1). The mediation between prior intuitions and newly acquired insights contribute to deeper, performance-enhancing learning approaches (Bälter et al., 2013, p. 234). Zimmerman’s triadic model of self-regulated learning (SRL) Zimmerman’s socio-cognitively shaped model of SRL, provides further schematization by categorizing the three main determinants of feedback processing, which were only loosely defined in the Bangert-Drowns et al. five-stage model. Zimmerman and Clark thereby also ascribe the central role of feedback reception to the learner (“self”), who behaviorally, cognitively, and emotionally regulates the information to be processed (2014; Panadero, 2017, p. 19). Figure 3.2 illustrates the reciprocal triadic self-regulation loop with its three determinants of self-regulatory processes.

3.2 Formative Feedback and Achievement Motivation

45

Figure 3.2 Zimmerman’s Model of Self-Regulated Learning (2013). (Note. Source: Author’s own based on Zimmerman & Moylan, 2009, p. 300)

Personal (self) determinants refer to internally covert processes that individuals use for self-regulation, such as recurring on their prior knowledge, self-reactions, self-motivation, affective-emotional beliefs, or goal setting—which is referred to as forethought phase. In this phase, learners analyze the upcoming task and plan it strategically to keep themselves motivated. As the forethought phase is anticipatory by nature, its performance relevance is determined largely by the individual manifestations of self-efficacy and value appraisals as key sources of self-motivation (Zimmerman & Moylan, 2009, p. 301). These reflections are evoked or adapted by means of enactive feedback and translated into action by means of strategic use. The behavioral and environmental determinants refer to the actual phase of performance within particular educational settings. Different to the mechanistic models (section 3.1), such interactional models consider that learners have to process feedback internally on any occasion by juxtaposing their current state of performance to their wished-for profile (Panadero, 2017, p. 20; Zimmerman & Clark, 1990, p. 372). At the self-reflection phase, this comparison produces internal feedback reactions at the cognitive, behavioral, motivational, and emotional level during self-evaluation that likewise implies differential reprocessing of the feedback in the next feedback loop (Lipnevich & Panadero, 2021, p. 14; Zimmerman & Moylan, 2009, p. 300). The associated appraisals and beliefs are considered important determinants of motivation to further persist at a certain learning behavior. Personal self-regulatory processes are informed by the recognition of the extent to which tasks had been solved adequately beforehand, i.e., successful mastery experience, leading to an internal adaptation of attitudes, intentions, and

46

3

A Model for Reciprocal Interrelations …

beliefs in the next iteration at the “self” level (Clark & Zimmerman, 2014, p. 487). Self-reflection is marked by either adaptive or defensive decisions (Zimmerman & Moylan, 2009, p. 304). For instance, if an individual receives satisfactory feedback from processing a task, this likely leads to an adaptive increase of self-efficacy and a reduction or maintenance of effort in further cycles (Kluger & DeNisi, 1996, p. 260). Feeling more confident in dealing with the respective topic in turn positively influences the learning behavior and confirms the effectiveness of prior learning strategies (Clark & Zimmerman, 2014, p. 489). A positive selfevaluation likely enhances subsequent motivational beliefs and efforts to learn (Zimmerman & Moylan, 2009, p. 304). If the prior feedback had been unsatisfactory, the learner can react either adaptively or defensively, i.e., by deciding to invest more effort to catch up with what had been missed, or avoid further effort due to dissatisfaction, respectively (Kluger & DeNisi, 1996, p. 263). Regarding the adaptive strategies, on the one hand, the learner might be prompted to realize a need for optimizing his learning behavior (Zimmerman, 2000, p. 88), e.g., rehearsing certain aspects, or to increase effort to attain the standard (Kluger & DeNisi, 1996, p. 260). Avoidance strategies, on the other hand, come along with feelings of helplessness and disengagement (Zimmerman & Moylan, 2009, p. 304). It should thereby be noted that the extent to which students’ prior expectations have been refuted impacts the perceived subsequent motivation and outcomes (Zimmermann & Moylan, 2009, p. 304) Hence, the self-reflection phase is closely cyclically linked to the subsequent forethought phase to acquire further proficiency. Similar to the five-stage cyclical model and different to Carver and Scheier’s feedback model (2000), attitudes are more prominently considered as processual determinants of behavior-shaping determinants rather than only as reactions to external stimuli. Although providing good indications for including feedback into learning processes to foster self-regulatory behavior, a behavioristic stance resonates with the model as it originally came from health education. More concretely, the environmental dimension is assumed to be haptically manipulable (Zimmerman, 2013, p. 137; e.g., opening the window under shortness of breath conditions). For learning-related contexts, the environmental setting seems rather moot and coincides with learning behavior itself. No relevant constructs are concretely denoted for each of the three determinants and more distal factors that influence learning processes, such as sociocultural characteristics were not considered. Nevertheless, the reciprocal loop process between the clearly defined entities of feedback, self, and behavior helps to further conceptualize the iterative longitudinal alignment of the analysis model more explicitly given the context of the present study. Though attitudes and motivation in Bangert-Drowns et al. (1991) and Zimmerman’s model

3.2 Formative Feedback and Achievement Motivation

47

(2013) are recognized to be important before and after feedback provision, they are, except for self-efficacy, only loosely mentioned—e.g., “some expectation”, “feels some degree of certainty” (Bangert-Drowns et al., 1991, p. 217) or “affective factors” (Clark & Zimmerman, 2014, p. 487). Hence, a clear framework is needed to further differentiate these attitudinal dispositions for model building. Gist and Mitchell’s model of self-efficacy-performance relationship Gist and Mitchell propose a feedback reception model which was particularized to the self-efficacy change process but could be applied to other dispositions as well (1992, p. 189). The model has an added value in such a way that its focal concern is on the further processing of different types of feedback according to Bandura’s social cognitive theory (i.e., enactive mastery, vicarious experience, verbal persuasion, physiological arousal). These experiences provide information cues which are then factored in the attributional analysis of upcoming task requirements under consideration of personal, situational and resource constraints (Gist & Mitchell, 1992, p. 198). After the internal calibration process, the individual decides on the goal-level and on whether to engage in the task. The resulting performance is followed another feedback loop whereby the reception process comes full circle3 . The model thus more prominently accounts for the lens model of contemporary learning theories, which assume that learners are no blank slates, but that their perceptions likely mediate the relationship between task cues and engagement (Butler & Winne, 1995, p. 253). Eccles’ and Wigfield’s expectancy-value (EV) model of achievement motivation The theoretical framework for the modeling of motivational appraisals lies in the Wigfield and Eccles’ EV model (2002, p. 91), which originally stemmed from understanding motivational processes in math achievement (Ramirez et al., 2012). It assumes that learners’ cognitive deliberations about expectancies for success and the subjective (intrinsic and extrinsic) value attached to domainspecific subject matter mainly contribute to the motivation to learn and engage in achievement tasks (Duncan et al., 2021, p. 2). Figure 3.3 first scaffolds the relevant facets4 according to the EV model first, before they are reframed into the constructs relevant for the context of this study. 3

A pictorial representation of the model is omitted and will be included in the later theoretical model. 4 More distal dimensions from the EV model, such as the cultural milieu, family background, socializers’ beliefs, age, etc. are omitted from the further analyses because they were shown to be of less influence compared to the other factors (Cashin & Elmore, 2005, p. 510; Ruggeri et al., 2008, p. 68).

48

Cultural Milieu  Gender and culture role stereotypes of subject matter  Family demographics

Previous achievementrelated experiences

3

A Model for Reciprocal Interrelations …

Expectations of success Goals and selfschemata  Self-schemata of personal identity  Short- and longterm goals  Self-concept & perceptions of task demands

Subjective task value  Attainment value  Intrinsic value  Utility value  Relative cost

Achievement

Affective Reactions and Memories

across time

Figure 3.3 Theoretical Expectancy-Value Model. (Note. Source: Author’s own selection based on Wigfield’s & Eccles’ complete framework (2002))

The EV model illustrates the long-term ontogeny in such a way that the categories on the left-hand side focus on the macro-level, longer-term maturation processes that lead to more stable characteristic and self-beliefs related to their history of experiences (Eccles & Wigfield, 2020). The microanalytic integral parts of the motivational model on the right side of the model, expectancies of success and subjective task value, refer to most proximal determinants of achievement and achievement-related choices operating over shorter timeframes (Eccles & Wigfield, 2002, p. 118) and represent the focus of the current study. Expectancies of success are defined as convictions on how well they will perform in upcoming tasks, which are closely related to Bandura’s personal efficacy expectations (i.e., no incorporation of outcome expectations; Duncan et al., 2021, p. 4; Eccles & Wigfield, 2002, p. 119; Marsh et al., 2019, p. 342). Subjective task value refers to how satisfactorily a task meets intrinsic and extrinsic needs resulting in the relative attractiveness to keep on working on it (Wigfield & Eccles, 2002, p. 94). Subjective task value was postulated to consist of four components. (1) Attainment value refers to perceived personal importance of performing well on an achievement-related activity when it is in accordance their role identity and selfschema. (2) Intrinsic value corresponds to the interest and enjoyment drawn from processing a task while (3) utility value is the perceived usefulness for the lifedefining future of the learner. Finally, (4) relative cost may refer to the expected effort and liabilities attached to task completion, or opportunity costs (Eccles & Wigfield, 2020; Flake et al., 2015).

3.2 Formative Feedback and Achievement Motivation

49

The authors found that expectancies of success influence value appraisals more strongly than vice versa, so that individuals deem tasks more valuable when they find themselves capable to master them (Eccles & Wigfield, 2002, p. 120). Specifically, expectancies of success seem to relate more strongly to attainment and intrinsic value than to utility value (Wigfield & Eccles, 2002, p. 105). Relative cost has been shown to negatively relate to expectancy and the other value components (Kosovich et al., 2014, p. 793). The EV model implies that both subjective task value and expectancies of success are integral parts standing in conjunction with each other. For instance, self-efficacious learners might still not pursue achievement-related activities if they do not deem them valuable, and vice-versa, which might allude to the existence of multiplicative effects (see section 8.4.4). Both expectancies and values are influenced by longer-term and more distal socialization experiences represented in the middle of the Figure 3.3. This category includes achievement goals and self-schemata, beliefs about oneself in achievement situations and task difficulty related to the experiential environment of home, school, and other socializers, which affect the learners’ appraisals and eventually their motivation (Lazarides & Schiefele, 2021, p. 8; Eccles & Wigfield, 2020). Self-schemata and goals in turn are influenced by experiences on past performance, domain-specific gender-roles, and cultural stereotypes (Lazarides & Schiefele, 2021, p. 8). Most importantly, expectancies for success and subject value were shown to positively predict persistence and performance even when prior achievement-related experiences are controlled (Marsh et al., 2019, p. 343). Despite its encompassing conceptualization of competency acquisition conforming to Weinert’s multidimensional interpretation (see section 3.2.1), the EV model has four major drawbacks that are relevant for the empirical adaptation. First, there is a lack of empirical distinctiveness for some of the postulated EV model dimensions. The indistinct nomenclature partly stems from the proliferation of motivation theories in educational psychology and their changing historical alignment (Duncan et al., 2021, p. 2; Kosovich et al., 2014, p. 792; Marsh et al., 2019, p. 339). Even in their newest research paper, Eccles and Wigfield declared this lack of conceptual clarity to be unresolved and see an urgent need for theoretically reconceptualizing the core constructs develop new and more distinct measures (2020). For instance, attainment value and utility value find common ground in that they refer to the personal importance and general worth of task completion for self-identity formation, while utility value more specifically deals with the relevance for the learners’ long-range goals (Eccles, 2005, p. 112). Self-concept of one’s abilities and expectancies of success were strongly related to each other in confirmatory factor analyses and thus not clearly empirically distinguishable (Marsh et al., 2019, p. 343; Wigfield & Eccles, 2002, p. 96). Due to

50

3

A Model for Reciprocal Interrelations …

this overlap, attainment and utility value were summarized to “value” and selfconcept as well as expectancies for success to “self-efficacy”5 (Eccles & Wigfield, 2020; Ramirez et al., 2012, p. 834). Moreover, the corresponding category of “self-schemata and goals” refers to longer-term and more stable characteristics, whose interrelations with the neighboring constructs had been measured over several years (i.e., whole passages of childhood and adolescence). As the focus of the present study lies on the microanalytic interrelations between formative achievement and determinants of proximal motivational development, the other more distal aspects will be omitted (similar to the adapted EV model from Lazarides & Schiefele, 2021, p. 8). By contrast, the EV-achievement relations focused on in this study refer to more immediate, moment-to moment cognitive processes that can be represented over the course of a semester (Eccles & Wigfield, 2020). The second drawback is that, even though the appropriateness of the EV model for longitudinal assessments had been documented in various developmental studies, most of them focus on trajectories in the childhood or adolescence (i.e., aged 6–18; Eccles & Wigfield, 2020; Ramirez et al., 2012, p. 62) and neglect the potentially steadier characteristics of adult college students. Third, despite the consideration of personal factors, such as gender, prior knowledge, and ethnicity, the dynamical influence of other contextual factors, such as the teaching practices or other instructional processes, on the trajectories of expectancies and values has not yet been sufficiently considered or researched in the EV context (Lazarides & Schiefele, 2021, p. 4). Testing the EV model in traditional and flipped classroom configurations of large lectures will therefore be conducted to shed light on its applicability in a variety of educational contexts. Fourth, instructional learning factors that might influence the belief-behavior relationship are not explicitly factored in the EV model (Wigfield & Eccles, 2002, p. 109). The present study follows up on this by incorporating feedback on achievement outcomes. Such information influences learners’ appraisals of their retrospective performance (Pekrun, 2006, p. 336). It thus helps to determine whether their selfefficacious beliefs, for instance, were successfully translated into actual learning 5

Even though researchers mostly struggle in differentiating between self-concept and selfefficacy, Marsh et al., (2019) as well as Bong and Skaalvik (2003) provide meaningful distinctions based on a synthesis of item operationalizations. Self-concept evaluations, on the one hand, are mostly based on past accomplishments whereas self-efficacy, on the other hand, is a goal-oriented, prospective appraisal for a mostly short-term expected accomplishment. Moreover, self-efficacy appraisals tend to set absolute descriptions of the goals to be achieved while self-concept involves evaluations of skills based on social comparison. The items used in the later assessment of the construct clearly relate more to the criteria of self-efficacy, so that this term will be used henceforth.

3.2 Formative Feedback and Achievement Motivation

51

achievements and whether the assumed goals have been reached (see chapter 3). The second to fourth drawbacks call for a more microanalytic approach, based on Eccles and Wigfield’s recommendation to relate EV-related investigations to concrete domain-specific contexts (2002). This also conforms to findings suggesting that EV appraisals contain trait as well as state aspects, which operate more selectively in relation to specific task situations (Bandura, 1997; Finney & Schraw, 2003, p. 162). These findings imply that students’ situational appraisals may differ from their general attitude, which necessitates domain-specific assessments to adequately account for more dynamical, subject-specific changes within more narrow temporal contexts (Bandura, 1997; Finney & Schraw, 2003). Along these lines, a preliminary conceptualization of the theoretical feedback-model will be postulated with a short-term, domain-specific, and instructional alignment.

3.2.3

The Feedback-Related Achievement Motivation Processing Model

To address these drawbacks, the previously mentioned models will be intertwined based on the strong theoretical grounding of the EV model, which has already been concretized to the domain of statistics acquisition by means of the survey of attitudes towards statistics model SATS-M model (Ramirez et al., 2012). Empirical testing of the SATS-M approach with large samples however still remains to be done and is subject of the analyses that follow. Since the theoretical EV model and the reframed SATS-M use slightly different conceptualizations and constellations of constructs, Figure 3.4 juxtaposes the reframed model with the EV model under consideration of the above-mentioned, necessary adaptations. Moreover, to emphasize the cyclical nature of the learning trajectories, some of the terminology of Bangert-Drown’s five-stage (1991), Gist’s and Mitchell’s model of self-efficacy (1992), as well as Zimmerman’s triadic model of self-regulation (2000) is factored in the analytical model. Apart from the modifications mentioned above (i.e., merging of attainment and utility value, removing “goals and self-schemata”), Ramirez et al. added the new dimension of “affect” to the SATS-M model (2012, p. 834). They thereby extracted the enjoyment component from the original EV model conceptualization of intrinsic value (modified denotation: interest) and also included statistics-related anxiety as another central component within statistics knowledge acquisition (Onwuegbuzie, 2003). Longitudinally, a feedback loop from the final summative exam to prior achievement-related experiences as indicated in the EV model (see Figure 3.4) is

52

3

A Model for Reciprocal Interrelations …

Forethought phase

Performance phase

feedback loop across time

Formative achievement (bt-1 x)

Feedback

Feedback

Expectancy appraisals Expectations of success:Self-Efficacy Perceptions of task demands:Difficulty

Value appraisals Attainment & utility value:Value Intrinsic value: Interest Statistics anxiety & enjoyment:Affect Relative cost: Effort

Strategy use Strategy use Strategy use

Formative achievement (yt)

Feedback

Summative achievement

Strategy use

feedback loop across time

Self-reflection phase

Figure 3.4 Feedback-Related Achievement Motivation (FRAM) Processing Model. (Note. Source: Author’s own)

not considered because there is only data for a one-semester course that ended with the final exam. More importantly, formative achievement in each of the four quizzes was construed as short-termed prior achievement-related experience at the same time. Hence, current performance become imminent past experiences for the next iterative feedback loop (Eccles & Wigfield, 2020, p. 3). Following the EV rationale, EV appraisals at the beginning of the course influence subsequent formative achievement (i.e., quiz performance). The resulting corrective feedback at the current measurement occasion (yt ) thus becomes further processed, shorttermed prior mastery experience at the next measurement occasion (bt-1 )6 , which then again influences students’ expectancy and value appraisals in the subsequent feedback loop (Eccles & Wigfield, 2020). The learner will self-regulatorily decide in how far to appraise the feedback, e.g., appreciate its value, and whether they perceive to have control over the outcome (Lipnevich & Panadero, 2022, p. 17). These appraisals, in turn, influence behavior, i.e., formative achievement again, and henceforth. These feedback loops account for the assumption of modern attitudinal frameworks that cognitive-motivational trajectories underlie a reciprocal effects mechanism, combining the formerly popular skill-enhancement and skilldevelopment models (Burns et al., 2020, p. 78). This framework also considers

6

Different denotations for formative achievement at the current measurement occasion (yt ) and prior formative achievement (bt-1 ) were chosen to underline their different roles as endogenous variables and predictors, respectively.

3.2 Formative Feedback and Achievement Motivation

53

performance as potentially exogenous factor that can actively influence subsequent statistics attitudes in terms of a feedback effect. Before emotional, personal, and contextual determinants (i.e., gender, prior knowledge, course design) will be additionally considered in the analytical model, the theoretical and empirical state of research on the separate relationships between EV and (formative as well as summative) academic achievement will be particularized for hypothesis generation in the following subchapter.

3.2.4

Approaching the Theoretical and Empirical Research on the Interrelations between Construct-Specific Expectancy-Value Appraisals and Feedback

Based on the above-elaborated analytical framework, the specific relationships between the EV appraisals as well as formative and summative academic achievement will be considered from a theoretical perspective. Each of the following subchapters will be structured according to the chronological feedback reception process. Figure 3.5 visualizes the planned schematic procedure of each of the following construct-specific subchapters.

Construct definition

1) Theoretical evidence

2) Empirical evidence

Figure 3.5 Structure of the Theoretical Analysis for Sections 3.2.5–3.2.9 and 3.3.3

Before approaching the performance-appraisal interrelations, each subchapter introduces the respective construct along with its theoretical conceptualization and relevant terminology. Then, in a first step, following the intake of the information in the forethought and performance phase, theoretical considerations on the relationships of prior EV appraisals on (formative and summative) achievement

54

3

A Model for Reciprocal Interrelations …

will be outlined (1a). Subsequently, following the postprocessing of feedback in the self-reflection phase, the relationship of prior achievement on subsequent EV appraisals will be derived theoretically (1b). Afterwards, this theoretically elaborated feedback loop will be substantiated by empirical findings. In a second step, therefore, empirical findings on the relationship of EV appraisals on academic achievement will be considered from the domain of statistics education and, occasionally, other relevant domains (2a). The focus on statistics-related findings is because “omnibus traits” are recommended to be contextualized within the respective subject area to adhere to the EV-related assumption of domainspecificity (Bandura, 1986, p. 370; Usher & Pajares, 2008, p. 785; Wigfield & Eccles, 2002, p. 94). The next logical step of reviewing empirical research of prior achievement on subsequent EV appraisals (2b) reveals a limitation concerning extant findings on attitudes towards statistics. Studies within the domain of statistics education focus mostly exclusively on the relationship of attitudes with end-of-term, summative achievement, so that attitudes are in most cases construed as determinants of ability according to traditional self-enhancement model assumptions. By contrast, few studies used achievement motivation and achievement emotions as dependent variable with (formative) achievement as antecedent, thus neglecting the potential feedback effects (Lipnevich et al., 2021). This is why empirical studies outside the domain of statistics education will be consulted concerning the relationship of formative performance on subsequent appraisals in order to bring the feedback process full circle.

3.2.5

Reciprocal Relations between Statistics Self-Efficacy and Achievement

1a) Theoretical evidence on the appraisal → achievement relation Statistics-related self-efficacy is understood as the self-rated confidence in having an internal locus of control resulting in a believed capability to accomplish statistical tasks (Bandura, 1997; Eccles & Wigfield, 2002, p. 110; Schau et al., 1995). Theoretical support for assumptions of positive relationships between feedback and self-efficacy are rooted Bandura’s social cognitive theory according to which self-efficacy is a product of self-regulative processes involving self-monitoring and self-assessment. These self-referential thought processes are intended to target individual learning efforts towards wished-for reference frames and therefore assumed to positively affect achievement behavior (Bandura, 2015; Salzmann, 2015). According to situational strength hypothesis, rationally behaving students with a higher self-efficacy about their current level of knowledge are expected to

3.2 Formative Feedback and Achievement Motivation

55

adjust their learning behavior towards the higher learning benefit (Zimmerman & Moylan, 2009, p. 307). Moreover, attribution theory postulates that students feeling more competent are better able to naturalize external influences, such as negative feedback, to invest more time in studying, while less self-efficacious individuals more easily refrain from attempting because they perceive less personal controllability over external factors and tend to attribute success to luck instead of true ability (Beile & Boote, 2004, p. 58; Hurley, 2006, p. 441; Yeo & Neal, 2006, p. 1090). Students with a higher self-efficacy might also profit from a halo effect such that they have a general tendency to feel efficacious in performing a wider variety of free-choice tasks while less self-efficacious tend to avoid them (Yeo & Neal, 2006, p. 1090; Pajares, 2002, p. 116). These self-reinforcing mechanisms are strengthened by the fact that high prior self-efficacy is likely to be accompanied by stronger self-monitoring processes, highly set goals, and an optimistic, persistent approach to challenging tasks instead of shying away (Adams et al., 2019, p. 319; Compeau & Higgins, 1995, p. 192; Hammad et al., 2020, p. 1504; Pajares, 2002, p. 117; Wilde & Hsu, 2019, p. 3; Yeo & Neal, 2006, p. 1098). By contrast, students with little expectations of success are less likely to muster the motivation for engaging with the feedback information (Evans, 2013, p. 96). It could thus be assumed that students with a higher self-efficacy benefit rather from feedback opportunities than lower self-efficacious individuals who might not feel sufficiently capable of interpreting and applying the given feedback and thus either fail to effectively put it into practice in terms of self-improvement or completely avoid these opportunities (Banfield & Wilkerson, 2014, p. 292). 1b) Theoretical evidence on the achievement → appraisal relation Apart from the assumed impact of initial self-efficacy on performance, social cognitive theory postulates that the above-mentioned reference frames sharpen one’s sense of self-efficacy in the context of enactive mastery, i.e., repeated past success (Arens et al., 2022, p. 619; Beaston et al., 2018; d’Alessio, 2018, p. 456; Gist, 1987, p. 473; Hurley, 2006, p. 441). Feedback thereby increases the likelihood for students to reflect on their mistakes and correct them (Hammad et al., 2020, p. 1520). Following the self-enhancement theory and entity views, positive evaluations are assumed to be received more positively and thus increases selfefficacy with the endeavor of maintaining a positive self-image (van de Ridder et al., 2014, p. 805). Negative feedback might however result in an attribution of failure to influences outside the individual sphere and enact unbeneficial, selfprotective affective mechanisms. The feedback intervention theory postulates that

56

3

A Model for Reciprocal Interrelations …

self-efficacy is lowered when tasks, particularly those including personally relevant goals, are unresolved (Gist & Mitchell, 1992, p. 199; Kluger & DeNisi, 1996, p. 261). Thus, self-efficacy interventions allowing students to purposefully gear their learning behavior and efforts towards the achievement of the desired outcomes by means of prior experience are considered to be most promising for statistics procedural reasoning (Beile & Boote, 2004, p. 58; Tolli & Schmidt, 2008). The social-cognitive assumption of reciprocity between enactive experiences and self-efficacy conforms to the postulated feedback loops within the analytical model. The quiz scores distributed throughout the semester also offer such learning occasions, so that processing of the quizzes and their reception is assumed to function as mastery experience allowing to better assess in how far current performance conforms to the required knowledge within the course (Evans et al., 2021, p. 164). In other words, success is assumed to enhance self-efficacy whereas failure has an undermining effect. 2a) Empirical evidence on the appraisal → achievement relation Empirical findings in various domains have shown that perceived self-efficacy is in most cases the strongest predictor for persistence and academic achievement (Chiesi & Primi, 2010; Finney & Schraw, 2003, p. 162; González et al., 2016; Hammad et al., 2020; Macher et al., 2013; Stanisavljevic et al., 2014; Tempelaar, van der Loeff, et al. 2007, p. 96). Regarding quiz scores, Beile and Boote (2004, p. 60), d’Alessio (2018), and Hood et al. (2021) found that students with higher self-efficacy performed better in formative and summative assessments than those with lower self-efficacy. Wilde & Hsu attested the same mechanism, but for vicarious instead of enactive experiences (2019, p. 17). Tolli and Schmidt (2008, p. 699) also found that failure attributions following negative feedback rather led to a downward goal revision in accordance with self-enhancement theory. These findings suggest an entity view in which individuals construe ability as a rather stable trait which limits performance and thus motivates higher efficacious students are more arduous while more doubtful students tend to work lackadaisically (Schunk, 1991). 2b) Empirical evidence on the achievement → appraisal Regarding the impact of feedback on subsequent self-efficacy, empirical findings suggest that enactive mastery and feedback were the most influential influencing factor of self-efficacy (meta-analytically: Ashford et al., 2010 with an effect size d = .43 and Talsma et al. (2018) with d = .21; in single studies: de Vries et al., 1988, p. 274; Evans, et al., 2021, p. 164; Yeo & Neal, 2006, p. 1090). Accordingly, instructional interventions (such as an “successful mathematics classroom”

3.2 Formative Feedback and Achievement Motivation

57

in Hammad et al., 2020) including formative self-constructed assessments on subsequent self-efficacy beliefs were shown to improve performance and subsequent self-efficacy beliefs (Beaston et al., 2018; Beile & Boote, 2004; d’Alessio, 2018; Hammad et al., 2020; van de Ridder et al., 2014). A few studies also took a closer look at the reciprocal efficacy-performance relationships in cross-lagged panel models (Arens et al.; 2022; Burns et al., 2020). Both studies found that math self-efficacy was strongly reciprocally related with self-constructed formative assessments (math test scores in Arens et al., (2022) and psychology quizzes in Burns et al., (2020), whereas the relationships between self-concept and achievement was only unidirectional. Arens et al. (2022, p. 628) also emphasized that summative grades are rather found to be reciprocally related with academic selfconcept due to their salience and the strong signaling effect regarding domain competence and further educational opportunities which are then incorporated into the individuals’ more general self-schemata according to the EV model (see the distinction between self-concept and self-efficacy in footnote 5). By contrast, the concrete handling of task-specific test items might have contributed more to the mastery experiences with respect to concrete course context (Arens et al., 2022, p. 628). Accordingly, it could be assumed that self-efficacy is (reciprocally) related with formative test scores, but not with end-of-term summative achievement. The reciprocal relationship was also supported in a meta-analysis on longitudinal studies with small effect sizes for general, and small to medium effect sizes for domain-specific measures of self-beliefs in relation to achievement variables including grades or self-constructed standardized tests (Valentine et al., 2004). Based on the above-elaborated theoretical and empirical state of research, the first hypothesis reads as follows:

H1 :

3.2.6

Formative achievement and statistics-related self-efficacy reciprocally positively predict each other throughout the semester.

Reciprocal Relations between Statistics Difficulty and Achievement

Compared to self-efficacy, there is far less theoretical and empirical evidence on the relations between task difficulty, in the narrower sense, and formative as well as summative achievement, which is accompanied by a fuzziness of the concept

58

3

A Model for Reciprocal Interrelations …

itself. First of all, a distinction has to be made between task complexity determined by test constructers by means of, e.g., item analyses, and the subjectively perceived difficulty experienced by a task-doer (Campbell, 1988, p. 48; Zeidner, 1998, p. 172). There are quite a few moderating influences, such as context or self-doubts, that may lead to a discrepancy between objective and subjective task difficulty (Campbell, 1988, p. 48; Maad, 2012, p. 2), so that, after all, difficulty remains a reaction to subject characteristics. To keep the theoretical framework in alignment with the EV model, which focusses on individuals’ perceptions, statistics-related difficulty in this study is construed as a concern of individually different learner perceptions on the difficulty of statistical content, thus refraining from objectivist, reductionist data (such as solution frequency) of performance assessments (Maad, 2012, p. 2). 1a) Theoretical evidence on the appraisal → achievement relation Bandura (2015, p. 44) subsumes subjective perceptions of task difficulty under the magnitude of one’s self-efficacy as it refers to the level of complexity of subject matter that is believed to be attainable. More concretely, individuals with a high (or low) difficulty appraisal would perceive themselves able to accomplish simple (or also complex) tasks (Bandura, 1986, p. 370; Compeau & Higgins, 1995, p. 192; de Vries et al., 1988, p. 273). Gist and Mitchell also agree that self-efficacy attributions are inter alia determined by past experiences about task difficulty (1992, p. 193). Pekrun (2006, p. 20) subsumes the perceived task demands under students’ perceptions of competence and control over academic activities, which influence expectations for success. Difficulty appraisals are assumed to be a more stable attribute with a lower variability over time, which are only controllable insofar as individuals cognitively appraise them in (more or less) beneficial ways (Autin & Croizet, 2012, p. 616; Gist & Mitchell, 1992, p. 193). Based on the retrieval effort hypothesis, it can be assumed that performance follows an inverted u-shape form with increasing task complexity (Brehm & Self, 1989, p. 119; Campbell, 1988, p. 49; Timmers & Veldkamp, 2011, p. 924). More concretely, it theorizes that motivational arousal increases with difficulty up to the point that exceeds desirable difficulties, where attentional demand is not manageable anymore (Brehm & Self, 1989, p. 109; Enders et al., 2021, p. 93). Hence, overly high difficulty appraisals are expected to overtax information processing capacities. This could lead to interfering off-task thoughts and uncertainties hijacking working memory capacities and eventually hamper performance if they are not controlled efficiently (Bastian & Eschen, 2016, p. 182; Campbell, 1988, p. 49; Elliott & Dweck, 1988, p. 6). Autin and Croizet assume that this is due to

3.2 Formative Feedback and Achievement Motivation

59

the cultural assumption that appraisals of difficulty are a sign of inability (2012, p. 614). Extremely difficult tasks are thus assumed to lead to failure expectancies whereas moderate levels of difficulty are most suitable to maximize motivational intensity (Brehm & Self, 1989, p. 118; Timmers & Veldkamp, 2011, p. 924; Zeidner, 1998, p. 173). Intermediate difficulty also avoids the experience of boredom if the task is too easy, or hopelessness if it is too complex (Deci et al., 1996, p. 177; Pekrun, 2006, p. 20). 1b) Theoretical evidence on the achievement → appraisal Another issue in the determination of an optimum level of difficulty lies in the differential response patterns to subjective complexity and failure contingent on students’ underlying goal affiliation according to the achievement goal theory (Elliott & Dweck, 1988, p. 10; Shute, 2008, p. 166). On the one hand, performance goal orientation is dominant if individuals focus on the sheer measurement of ability and its adequacy (Butler & Winne, 1995, p. 263). Under constant apprehension of failure, individuals however begin to feel helpless, anxious, disinterested and resort to maladaptive avoidance strategies leading to lower performance (Maad, 2012, p. 3; Muis et al., 2013, p. 556). On the other hand, when formative achievement is sufficiently high, students are likely to react in a mastery-oriented manner in the face of future obstacles and deem the difficulty level manageable (Elliott & Dweck, 1988, p. 10; Muis et al., 2013, p. 556). The mastery goal orientation is characterized by solution-oriented processing of difficult tasks whereby mistakes are construed as learning opportunities for further skill development. The negligence of failure attributions evokes positive affect and persistence in the face of challenges, eventually leading to improved and sustained performance (Shute, 2008, p. 166). Feedback provision thereby is an effective means to translate mistakes into learning opportunities and to shift the focus from performance to mastery goal orientation (Shute, 2008, p. 167). Information on the current performance would contribute to reduce concerns about intellectual incapability by reframing difficulty as a natural byproduct of learning (assessment for learning) that should be actively addressed instead of recurring on avoidance strategies (Autin & Croizet, 2012, p. 615). In other words, performance information aims at decreasing the cognitive interferences of difficulty as a threat to the self-image by suggesting that “mistakes” can be prevented and amended if they are laid bare (Autin & Croizet, 2012, p. 611). The processing of performance feedback also depends on the level of appraised task difficulty in such a way that successful completion of demanding tasks might be a greater facilitator of subsequent appraisals

60

3

A Model for Reciprocal Interrelations …

while easy tasks might have no meaningful impact on motivational trajectories (Bandura, 1986, p. 363). 2a) Empirical findings on the appraisal → achievement relation The complicacy to determine favorable levels of perceived difficulty for performance enhancement is also reflected in the inconsistency of empirical findings. Tempelaar, van der Loeff, et al. (2007, p. 96), on the one hand, found that lower perceived difficulty resulted in lower statistics performance, implying that students may have either underestimated their coping abilities or that the insufficient complexity of the tasks did not sufficiently stimulate goal attainment. On the other hand, Chiesi & Primi (2010) found a significantly positive relation indicating that lower perceived difficulty related to higher achievement. Other studies in turn did not find any significant relationship (e.g., Stanisavljevic et al., 2014) and in a review of 17 studies, Ramirez et al. found that difficulty was “rarely related to achievement” (2012, p. 65). Following this conundrum, Schau and Emmioglu (2012, p. 93) went on to argue that neither simple nor overly challenging tasks contribute to beneficial motivational and cognitive development, therefore assuming non-linear relationships between task difficulty and performance. Indeed, Brehm and Self (1989) as well as Locke and Latham (2002) found that achievement behavior is highest for intermediate levels of difficulty. In their meta-analysis, they found a positive relationship between difficult goals and performance with effect sizes ranging from d = .52 to d = .82. The performance increase was however delimited by (lack of) ability and persistence (Locke & Latham, 2002, p. 706), alluding to the inverted u-shape of the difficulty-performance relationship. 2a) Empirical findings on the achievement → appraisal relation Findings from Timmers and Veldkamp (2011) suggest that, paradoxically, students with higher test scores in formative assessments were more susceptible to subsequent elaborate feedback on the remaining incorrect answers. By contrast, students with a lower test score (i.e., for whom the task was more difficult) tended to disregard feedback information. Hence, paradoxically, those students who would be in higher need of corrective feedback lose the motivation to follow up on their misconceptions and would thus sustain their high difficulty appraisal. These preliminary findings are suggestive for in two ways: As elaborated in the theoretical part, the amount of feedback information might overtax the working memory of less proficient students and result in a complete resistance to learning. Secondly, students with a higher score might indeed shift from the performance to the mastery goal approach in such a way that they continue to work

3.2 Formative Feedback and Achievement Motivation

61

on the remaining knowledge gaps (Maad, 2012, p. 4). Both implications engender that performance-related motivational effects get bogged down in positive and negative extremes (i.e., positive difficulty appraisal further improves whereas negative appraisal further decreases performance), which conforms to the theoretical and empirical findings on the self-efficacy-performance relationship. On grounds of the ambiguous state of research, the following study goes with the intuitive hypothesis of assuming that higher difficulty is related to subsequent failure and that higher performance leads to a beneficial lowering of the difficult appraisals according to the achievement goal theory while lower performance might result in disregarding motivation-enhancing feedback information:

H2 :

3.2.7

Formative achievement and statistics-related difficulty appraisals reciprocally negatively predict each other throughout the semester.

Reciprocal Relations between Statistics Interest/Utility Value and Achievement

Loss of interest and value have been found a major reason to opt out of introductory STEM courses (Rosenzweig et al., 2020, p. 167). Referring to the EV model, the following paragraph discusses relations between three components of subjective task value: utility/attainment value (hereinafter: value) and intrinsic value (hereinafter: interest)7 . SDT allows for a further differentiation of these two appraisals along the degrees of internalization (Deci, Ryan, & Williams, 1996, p. 167). Interest refers to the perception of tasks as inherently enjoyable due to individual inclinations (Cai et al., 2018, p. 436; Cole et al., 2008, p. 613; Harackiewicz, Canning, et al., 2016, p. 746). Interest can be further differentiated into situational interest referring to the state of momentary arousal, enjoyment and excitement when performing a specific task whereas individual interest implies a longer-term, trait-like inclination to specific domains (Eccles & Wigfield, 2002, p. 114; Harackiewicz, Smith, et al., 2016b, p. 220; Rosenzweig & Wigfield, 2016,

7

The cost and enjoyment value components will be considered separately in the subsequent paragraphs.

62

3

A Model for Reciprocal Interrelations …

p. 152)8 . According to the SDT, interest refers to an entirely intrinsically motivated behavior because it is not linked to a separable consequence but has the autotelic purpose to experience feelings of enjoyment and satisfaction (Burton et al., 2006, p. 750; Taylor et al., 2014, p. 342; Weidinger et al., 2016, p. 118). The opposite pole, extrinsic motivation, refers to behavior that strives for the attainment of concrete consequences (e.g., rewards, acceptance). Internalization refers to the transformation of originally extrinsic motivation in such a way that the learner performs the instrumental behavior wholly volitionally in accordance with his sense of self and goes from entirely external (i.e., initiated by rewards), over introjected (i.e., internalized for reasons of self-worth without fully accepting the regulation), to identified/integrated regulation (initiated from a sense of personal meaning; Deci & Ryan, 2016, p. 11; Taylor et al., 2014, p. 343; Hulleman & Barron, 2016). Value in the context of this study refers to the perceived usefulness and importance of a task to fulfil some short- or long-term goal (i.e., study success and career aspirations) or for personal identity formation. The value appraisals targeted in this study rather assesses in how far learners deem the activity personally utile, which refers either to identified or integrated regulation, depending on the extent to which they integrated the values into their own identity formation (e.g., due to coherencies with own career aspirations; Deci et al., 1996, p. 169). In short, interest refers to unforced engagement in personally interesting activities whereas integrated motivation is still conducted voluntarily but with an instrumental purpose that is deemed meaningful and important. Both utility value and intrinsic value can therefore still be assumed to fall under the category of intrinsic motivation. 1a) & b) Theoretical findings on the appraisal  achievement relation For the assumed relationship between utility and intrinsic value with subsequent achievement, the EV theory assumes that students direct their motivation and devote their time to tasks and goals that they find valuable and interesting (Acee & Weinstein, 2010, p. 488; Gaspard et al., 2015, p. 1227; Wigfield & Eccles, 2002, p. 109). Changes in students’ interest and value appraisals are commonly ascribed to models of persuasion and conceptual change which stem from information processing theory. More concretely, depending on students’ coping mechanisms, the more favorably a feedback information is processed, the higher 8

In the context of this study, the interest construct used for later measurement rather refers to the individual interest while the enjoyment construct is the situational interest related to concrete activities.

3.2 Formative Feedback and Achievement Motivation

63

its impact on the targeted attitude, and vice versa. Assuming that the processing of feedback conveys a useful instrumentality of the task for one’s own progress, students might positively receive the value of the tasks to follow (Acee & Weinstein, 2010, p. 490). Regarding the impact of feedback on subsequent interest and utility value, cognitive evaluation theory can be consulted as an integral part of SDT. It postulates that the impact of external events with informational significance (e.g., rewards, grades, feedback) on intrinsic motivation is mediated by the underlying locus of causality (Deci et al., 2001, p. 3; Clark & Svinicki, 2015, p. 52; Reeve, 2012, p. 153). Hence, feedback associated with pressuring or controlling aspects and involving low perceived competence in the sense of SDT (external locus) is expected to decrease intrinsic motivation. By contrast, positive, informational competence feedback, involving higher perceived self-determination (internal locus), increases intrinsic motivation because it communicates competent functioning and refers to personal fulfilment as well as mastery orientation (Baker & Goodboy, 2019, p. 82; Reeve, 2012, p. 156; Weidinger et al., 2016, p. 118). Based on these theories, positive correlations between intrinsic motivation and achievement can be expected. Concerning the categorization of positive and negative feedback, in the context of the present study, performance information is given as portion of correctly solved exercises, so that the degree of appraised positivity might also depend on the magnitude of this share. I.e., a score above .75 could be perceived as more positive because it consists of more competence-affirming than contradictory advice (Lipnevich et al., 2021). Irrespective of such arbitrary cutoffs and their individually different receptivity, it could be assumed that feedback including more negative information contradicting the current knowledge level leads to decreases and predominantly positive information fuels intrinsic motivation (Deci et al., 2001, p. 5)9 . 2a) Empirical evidence on the appraisal → achievement relation Compared to the relations between feedback and self-efficacy, research on intrinsic and utility value is scarcer (Acee & Weinstein, 2010, p. 489). Starting with the relationship of interest and value on achievement in the context of statistics education, mixed findings on the one hand suggest that interest (Stanisavljevic et al., 2014) or value (Carmona et al., 2005; Kiekkas et al., 2015) did not relate significantly to academic achievement whereas other studies found significant 9

The reception of the feedback as positive or negative also depends on whether students are entity or incremental theorists and on the goal approach they pursue (i.e., mastery or performance goal).

64

3

A Model for Reciprocal Interrelations …

relations at least one of the components (for interest: Macher et al., 2012; Macher et al., 2013; for value: Chiesi & Primi, 2010; Stanisavljevic et al., 2014; for both: Paul & Cunnington, 2017). The significant (co-)relations for value and interest were lower than those of self-efficacy beliefs in any case. This coincides with experimental and longitudinal findings from other domains in which expectancies for success were found to be stronger predictors of achievement whereas subjective task value was rather found to relate to achievement-related choices, such as persistence, enrollment, or career decisions (Acee & Weinstein, 2010, p. 488; Cole et al., 2008, p. 613; Harackiewicz, Smith, et al., 2016, p. 223; Pintrich & De Grot, 1990; Wigfield & Cambria, 2010, p. 21). Nonetheless, research has shown that higher degree of intrinsic motivation (with identified/integrated motivation at the most autonomous end) is associated with better adjustment, higher learning outcomes and higher well-being (Burton et al., 2006, p. 750; Deci et al., 1996, p. 170; Weidinger et al., 2016) while extrinsic pressures were found to undermine motivation and performance (Eccles & Wigfield, 2002, p. 112). These findings were confirmed in a meta-analysis including 18 studies and 4,000 participants that assessed the relationship of intrinsic motivation on achievement across varying school and university contexts with an effect size of d = .35 (Taylor et al., 2014, p. 345). Extrinsic motivation, by contrast even had a negative effect of d = −.22. In the context of this study, interest assesses a higher degree of intrinsic motivation, so that it might be more strongly related with achievement compared to utility value, which is only merely an internalized instrumental behavior. In short, utility value refers to a means-end orientation while interest value sees the activity as an end in itself (Wigfield & Cambria, 2010). In further single studies, Cole et al. (2008) found that students attaching more importance and usefulness to domain-related tasks invest more effort and perform better on low stakes tests. In a literature review, Deci et al. (1996, p. 171) found a few studies with positive correlations between interest, retention, conceptual learning, and academic achievement in general. In a literature review on EV appraisals and their relation to performance, Wigfield and Cambria (2010) report several studies suggesting that intrinsic and utility value and performance predict each other reciprocally. Their review however focuses mainly on elementary and secondary school students. 2b) Empirical evidence on the achievement → appraisal relation Regarding the impact of feedback on achievement, Butler (1988) suggested the distinction between task- and ego-involving types of feedback to account for differential manifestations in the relationships. Task-involved evaluation (e.g., individual information or comments on tasks that have not yet been mastered)

3.2 Formative Feedback and Achievement Motivation

65

aims at enhancing mastery development by means of relevant and challenging tasks while ego-involving evaluation utilizes normative grades (Butler, 1988, p. 2). In a random experiment, the author found that interest was only enhanced for task-involving evaluation; the ego-involving evaluation was only beneficial for high-achievers that anticipated the provision of grades whereas it had detrimental effects on interest of lower achieving students (Butler, 1988, p. 11). In the present study, interest-enhancing effects could be assumed since the four quizzes are ungraded and therefore task-involved. In a meta-analysis on the effects of rewards and reinforcement on intrinsic motivation, Cameron and Pierce found that positive feedback and verbal praise enhance students’ intrinsic interest while negative feedback focusing on lack of skill is perceived as demotivating (1994, p. 397). The detrimental impact of negative feedback on intrinsic motivation was also found in a study based on path mediation models (Weidinger et al., 2016). The negative feedback conveyed information on the individual performance relative to other same-aged individuals how many of their peers performed better on a numerical test. A limitation of this study however is that the performance information was fictive, i.e., unrelated to the actual performance, which may have caused even stronger debilitating effects on subsequent intrinsic motivation if participants misjudged. Compared to interest, (utility and attainment) value is considered to be easier amenable by instruction because it is considered as most extrinsic due to the references between task and its usefulness to personal goals (Gaspard et al. 2015, p. 1227; Harackiewicz, Smith, et al., 2016b, p. 223; Hulleman & Barron, 2016). Therefore, apart from correlational studies suggesting performance-enhancing impacts of utility value, a prominent strand of research is devoted to the investigation of value-reappraisal interventions with different treatment conditions. In most cases, either (1) the instructor communicates the importance and potential value of the respective subjects or (2) the students write short essays on the reasons why they deem the subject subjectively important for their everyday life and professional life (e.g., Harackiewicz, Smith, et al., 2016; Hulleman & Barron, 2016; Hulleman et al., 2010). Value-reappraisal interventions of the first category for instance present different messages (e.g., quotations from domain specialists) on the professional relevance and the intrinsic enjoyment of the respective domain which students are asked to scrutinize and evaluate (e.g., Acee & Weinstein, 2010; Cai et al., 2018; Rosenzweig et al., 2020). The second category entails assignments asking students to write short texts on their self-appraised value of the respective subject (e.g., Gaspard et al., 2015; Harackiewicz, Smith, et al., 2016).

66

3

A Model for Reciprocal Interrelations …

The large majority of both intervention types were found to positively impact students’ subjective task value, instrumentality, and academic achievement and this in accordance with the EV model compared to control groups without the value treatment. The second type of interventions was found to be more effective than externally communicated relevance, which might be due to the fact that appreciation of the course material is higher when students seek meaning and make connections in their own terms while also promoting deep levels of cognitive processing (Harackiewicz, Smith, et al., 2016; Rosenzweig & Wigfield, 2016, p. 151). Zigoni & Byron (2017) found that the value attached to feedback depends on whether recipients are entity or incremental theorists, i.e., whether they believe ability is changeable through effort or fixed. By means of an experimental design, they found confirmation in that entity theorists perceived corrective feedback as less valuable, because they may have considered it an ego-threat from which no helpful information can be inferred. By contrast, incremental theorists valued corrective feedback more and had better feedback outcomes because they may have seen no motivational conflict in corrective feedback and use the information to gear the efforts towards self-improvement (Zingoni & Byron, 2017, p. 53). In all, theoretical and empirical research suggest that interest and value appraisals impact achievement and are also influenceable by means of instruction. The empirical findings however allude to the fact that targeted interventions are necessary to precipitate changes in these appraisals. Hence, interest and value are hypothesized to be reciprocally positively related with formative achievement while keeping in mind that the type of feedback used in this study (numerical feedback without added further relevance explanations and information) might impact the salience of the relationships. Moreover, as argued above, interest could be stronger related to achievement compared to utility value due to its higher extent of intrinsic motivation. Based on their literature review, Wigfield and Cambria infer that interest and value are more strongly related to achievement in contexts of free choice (2010, p. 22). Hence, it could be assumed that both appraisals more likely predict quiz than exam scores, since the former was not attached no external rewards and thus less controlling—while the exam was graded.

H3 :

Formative achievement and statistics-related interest and utility value appraisals reciprocally positively predict each other throughout the semester.

3.2 Formative Feedback and Achievement Motivation

3.2.8

67

Reciprocal Relations between Statistics Affect and Achievement

Statistics-related affect according to the SATS-M model is a rather heterogenous construct referring to feelings of enjoyment, frustration and anxiety which arise when students cope with statistical information, tasks, or educational settings (Onwuegbuzie & Wilson, 2003). The construct is partly related to the EV dimension of “affective reactions”, which is an antecedent of EV appraisals. Strikingly, the terminology of affective reactions has never been defined in greater detail in the works of Eccles and Wigfield (2002). Statistics-related affect also overlaps with the EV dimension of relative emotional costs as part of the subjective task values such that is refers to the experience of joy, anxiety, and worry when dealing with statistics. The enjoyment part of affect will be omitted in this chapter as enjoyment will be elaborated in more detail in the scope of achievement emotions. The affect construct in the context of the SATS-M model is similar to statistics anxiety, which is one of the most reported emotions of undergraduate students in their first statistics course (González et al., 2016, p. 214; Onwuegbuzie & Wilson, 2003). Statistics anxiety is assumed to be determined interactively by trait anxiety and stress occurring when exposed to statistics coursework. State anxiety is a feeling of affective arousal and worry tied to achievement activities that necessitate high attention or working/short-term memory capabilities (Cassady & Gridley, 2005, p. 5; Eysenck & Calvo, 1992; González et al., 2014, p. 214; Hembree, 1988; Zeidner, 1998). In the following, affect and anxiety will be considered highly related with each other as these appraisals are based on affective reactions and are thus mentioned interchangeably. 1a) Theoretical evidence on the appraisal → achievement relation Processing efficiency theory and drive theory assume that state anxiety or negative affective reactions have two antithetical effects which stem from (1) a cognitive interference component of worry and (2) an emotional arousal component (Eysenck & Calvo, 1992, p. 412; Hembree, 1988, p. 33). On the one hand, worry has a self-directed intrusive effect on the available working memory resources (storage and processing capacity), thus initiating task-irrelevant and inattentive processing, procrastination, and the use of shallow learning strategies (Cassady & Gridley, 2005, p. 6; Covington & Omelich, 1987, p. 393; Malespina & Singh, 2022, p. 3; Núñez-Peña et al., 2015, p. 85). Accordingly, statistics anxiety and affect were also documented to impair persistence and the use of self-regulatory or deep learning strategies (González et al., 2016, p. 214). On the other hand, the arousal component involves an activating state of attentiveness

68

3

A Model for Reciprocal Interrelations …

and vigilance to complete a challenge, thus fostering performance-enhancing ontask effort (Eysenck & Calvo, 1992, p. 409; Hembree, 1988, p. 33). The arousal component in academic settings may come into effect when individuals counterweigh their anxiety to the challenging situation with the potential negative consequences of avoiding or failing the task (Eysenck & Calvo, 1992, p. 413). When these aversive consequences are perceived to be higher than actually engaging with the challenging situation, individuals will likely continue the task with more effort to not risk falling short in terms of performance. Put simply, comparable to difficulty appraisals, anxiety and affective reactions may foster learning to some degree but is likely to impair performance when it abounds (González et al., 2016, p. 214). Difficult situations or difficult tasks are assumed to reinforce the debilitating effect of worry on performance. Research therefore devoted particular attention to test anxiety because individuals are eminently prone to worry over evaluation in stressful situations (Eysenck & Calvo, 1992, p. 410). The transactional model of stress postulates that the anxiety or negative affective reactions in evaluative situations results from an anticipation of failure or ego threat which is associated with the environmental factors of the situation (Zeidner, 1998, p. 171), The dependence of test taking conditions on the strength of the affect-performance relationship goes hand in hand with the blockage hypothesis, assuming that anxiety and affect do not impair the original learning processes, but the actual test performance phase (Covington & Omelich, 1987, p. 393). 1b) Theoretical evidence on the achievement → appraisal relation For a better understanding of the variation in anxiety and affective reactions, researchers explored different situational determinants and their influence on test anxiety, following the common understanding of viewing learning as a recursive process (Zeidner, 1998, p. 171). Cassady and Gridley attached the label of learning-testing cycle to this approach, entailing test preparation, performance, and reflection (2005, p. 5). Test preparation is particularly relevant for the current study because students will build up confidence and eliminate unfavorable factors while they are getting successful (Erzen, 2017, p. 76). This is particularly relevant to prevent individuals prone to failure from adopting maladaptive avoidance strategies instead of approaching learning (Zeidner, 1998, p. 178). In other words, by means of training, individuals are enabled to eliminate anxiety-inducing uncertainties, thus becoming even more successful. Accordingly, the formative assessments are to prompt students to alleviate their extant performance gaps and to adapt to the feared final test situation (Cassady & Gridley, 2005, p. 5). Concerning the reception of failure feedback, individuals are either assumed to cope with or naturalize emerging worries by means of denial, or by investing

3.2 Formative Feedback and Achievement Motivation

69

additional effort and increasing their working memory resources to the tasks at question, such as rehearsal (Eysenck & Calvo, 1192, p. 416). High-anxious individuals are assumed to be less responsive to formative feedback situations than low-anxious individuals because they are inclined to negatively process negative test experiences (Eysenck & Calvo, 1992, p. 421; Zeidner, 1998, p. 179). Based on processing efficiency theory, the negative impact of worry on working memory resources is expected to drain effort for compensatory anxiety-reducing mechanisms instead of being responsive for the actual motivational conditions. Following the transactional model of stress, any characteristic of the test environment influences the perceived threat of a test situation and the individual’s appraised anxiety (Zeidner, 1998, p. 171). Formative achievement contexts without immediate aversive consequences might be assumed to have a higher beneficial impact on anxiety reduction. Test-enhanced learning assumes that repeated formative, non-threatening opportunities helps students practice their test-taking strategies and reduces perceived threats or worries related to inappropriate exam preparation (Erzen, 2017, p. 85; González et al., 2016, p. 220; Khanna, 2015, p. 174; Morris et al., 2021, p. 3; Zeidner, 1998, p. 181). Another aspect that might facilitate the anxiety-reducing impact of formative feedback lies in its computerized administration. Human-machine testing interactions are perceived to be more objective and neutral compared to proctored condition in a summative exam (Zeidner, 1998, p. 181). The more accessible and non-threatening, non-competitive, self-paced conditions of the formative assessment (i.e., no time restrictions, unsupervised processing, open book) enhance perceive control and pose less demands for processing and storage resources of the working memory (Cassady & Gridley, 2005, p. 6; Pekrun & Linnenbrink-Garcia, 2012, p. 275; Riegel & Evans, 2021, p. 83; Zeidner, 1998, p. 181). These alleviations are in turn are assumed to mitigate the likelihood of counter-productive emotionality, selfinhibiting ruminations and avoidance strategies while leaving more room for the positive activating arousal component of anxiety and the allocation of additional resources (Cassady & Gridley, 2005, p. 5). This would eventually also reduce the debilitating effect of anxiety on performance. Mathematics anxiety is the most widely researched domain-specific anxiety and has been meta-analytically documented to be negatively related to math achievement and unalterable by means of classroom interventions (Hembree, 1988; Iossi, 2007). Math anxiety was seen to be similar to statistics-related anxiety, some researchers point out that statistics tends to be more application-oriented and thus anxiety impairs cognitive processes differently (Baloglu, 2003; González et al., 2016, p. 215).

70

3

A Model for Reciprocal Interrelations …

2a) Empirical evidence on the appraisal → achievement relationship A predominant body of research suggests that statistics-related positive affect and anxiety rank among the most meaningful predictors of course performance compared to intrinsic and utility value (Erzen, 2017; González et al., 2016; Onwuegbuzie, 2004; Budé, 2007; Chiesi & Primi, 2010; Macher et al., 2012; Macher et al., 2013; Ramirez et al., 2012; Stanisavljevic et al., 2014), suggesting that emotionality has a central role in statistics knowledge acquisition. By contrast, only a small number of studies in the context of statistics education found no significant relationship between affect/anxiety and performance (e.g., Tempelaar, Gijselaers, et al., 2007). Onwuegbuzie (2004) found that randomly assigned, highly test anxious students performed worse in statistics examinations under time pressure compared to highly test anxious students in the examination without time limit. This conforms to the above-elaborated theoretical research related to the blockage learning hypothesis assuming that test conditions moderate the affect-performance relationship. Chew and Dillon (2014) found that regular formative assessments that emphasize conceptual understanding were an effective means to improve statistics anxiety. Apart from that, no study in the field of statistics education was found that investigated the effects of formative achievement or varying test condition on anxiety. 2b) Empirical evidence on the achievement → appraisal relation Opening up the state of research for domains other than statistics, a metaanalysis including approx. 500 studies on causes and treatments of academic test anxiety, found item-by-item feedback to be insignificant whereas at least the high-anxious students benefited from the frequency of testing (Hembree, 1988, p. 65). Núñez-Peña et al., (2015) investigated two cohorts of psychology students, in which only the second one used a formative assessment system with a series of problem-oriented tasks for which they received feedback. No significant correlation between math anxiety and achievement could be found for this cohort, which differed from the first cohort from the prior academic year, for which anxiety and performance were found to be negatively related (Núñez-Peña et al., 2015, p. 85). The authors assume that the loss of the negative relationship is a good omen, suggesting that the formative assessment system reduces the negative impact of anxiety on final exam grades through successively built-up of mastery experience. Khanna (2015) found that students processing ungraded pop quizzes were significantly less anxious than those who processed a graded pop quiz. Students’ negative affect was however only assessed by means of one manifest item. Cassady and Gridley (2005) as well as Cassady et al. (2001) found that formative assessment without evaluative pressure (i.e., ungraded) by means of quizzes in

3.2 Formative Feedback and Achievement Motivation

71

an ecologically valid setting reduced the impact of anxiety on performance in high-stake test situations compared to examinations for which no prior practice tests were available (Cassady & Gridley, 2005). Cassady et al. (2001) found that high-anxious students did not use the online quizzes and assumed this might be due to an alleged lack of sufficient preparation or to avoid self-doubt and emotional distress. However, the authors only accounted for quiz usage and not for quiz performance. Covington and Omelich (1987) provided are more detailed insight into these differential effects by administering the same set of items under exam conditions and nonevaluative conditions (retest). They found that only those high-anxious psychology students profited under the nonevaluative conditions on easy items that initially reported to use effective study strategies. Even though unannounced, a limitation of this study is that the retest took place one day after the exam, so that memorization effects might well have attenuated the generalizability of the nonevaluative testing condition (Covington & Omelich, 1987, p. 395). In all, the theoretical state of research suggests that high-anxious individuals tend to have a lower performance compared to low-anxious individuals and that the relationship between achievement and anxiety as well as affect is bidirectional (Erzen, 2017, p. 76). The anxiety-reducing impact is expected in particular due to the nonevaluative and non-threatening testing conditions of the formative assessment. In a few quasi-experimental studies among secondary school students, experimental groups participated in regular formative quizzes while the control group did not. T-tests have shown that the anxiety level at the post-test was significantly lower for the experimental group compared to the control group (Moradi et al., 2021). In all, the theoretical and empirical state of research suggests that affect/anxiety and achievement reciprocally negatively interrelate with each other, whereby the relationships might differ on grounds of the formative and summative assessment conditions.

H4 :

Formative achievement and statistics-related affect and anxiety reciprocally positively predict each other throughout the semester.

72

3.2.9

3

A Model for Reciprocal Interrelations …

Reciprocal Relations between Statistics Effort and Achievement

Compared to the other components of subjective task value, the role of relative cost and effort, and their influence on academic achievement have been largely under-researched (Jiang et al., 2018, p. 140; Perez et al., 2019). In particular, effort was often only investigated descriptively by comparing or delineating participation rates or by anecdotal reports (e.g., Chans & Castro, 2021; Evans, et al., 2021; Sancho-Vinuesa et al., 2013). The lack of research may partly stem from the fact that student effort has been viewed as a multidimensional meta-construct without a consistent rationale (Fredricks & McColskey, 2012, p. 764; Henrie et al., 2015, p. 37). The myriad of conceptualizations must therefore be related to the context of this study before elaborating on potential interrelations with achievement. Generally, student engagement is subdivided into three components that refer to the main agents accompanying the sequences of action (Fredricks & McColskey, 2012, p. 764; Huang et al, 2018, p. 1109; Muir et al., 2019, p. 263). Behavioral engagement entails observable time on task, class attendance, completion rates and any other concrete involvement in academic activities. Cognitive engagement refers to the less observable psychological willingness to invest effort and to use of self-regulatory, i.e., volitional learning strategies to undertake academic activities. Emotional engagement addresses the emotional byproducts of learning, such as interest and value, which are already covered by the previously elaborated constructs (Fredricks & McColskey, 2012, p. 764; Henrie et al., 2015, p. 37; Huang et al., 2013, p. 1109; Muir et al., 2019, p. 263). The present study focuses on self-reported cognitive engagement to account for students’ subjective perspective of the effort expended in their minds. Given the focus of this study on attitudinal and emotional perceptions, behavioral engagement, usually measured by means of objective attendance sheets, participation rates, or time recordings, will not be further considered. On the unsettled position of effort within the EV model Within the EV model, effort, as part of relative cost, was considered a part of subjective task value. Yet another reason for the lack of research is that student effort was consequently neglected in favor of the other value facets (i.e., attainment, utility, intrinsic value) until recently (Jiang et al., 2018, p. 140). Relative costs in the EV model refer to the mostly negatively connotated consequences of engaging in certain activities (Gaspard et al., 2015, p. 1227; Harackiewicz, Canning, et al., 2016). Wigfield and Eccles distinguish between emotional costs (negative emotional consequences and ego threats related to engagement in an

3.2 Formative Feedback and Achievement Motivation

73

activity), opportunity costs (perceptions of forgone and more valuable alternative activities), and effort costs (perceptions of the effort required for task execution; Eccles & Wigfield, 2002; Jiang et al., 2018, p. 140; Rosenzweig et al., 2020, p. 167). Following the prioritization of the SATS-M, the present study primarily focuses on effort costs. This leads to another conceptual ambiguity concerning the positioning of the effort construct within the theoretical model. While the EV perspective subsumes effort cost under the umbrella of subjective task value (Wigfield & Eccles, 2002), the SATS-M, in accordance with Ajzen’s theory of planned behavior (1991), assumes effort to be a consequence of EV appraisals and the preamplifier of academic achievement (Ramirez et al., 2012). This deviation of the role of effort cost refers to a debate on which research has not yet found a consensus (Jiang et al., 2018, p. 140). The original rationale of subsuming effort cost under the subjective task values was the understanding of a cost/value ratio that only differs from the other three value facets (i.e., attainment, intrinsic, utility value; Jiang et al., 2018, p. 140) in terms of its negative valence. Precisely this difference in valence might however cause entirely differential effect mechanisms such that cost is rather predicting avoidance- instead of approach-related motivational patterns. In order to avoid inaccurate predictions of achievement behaviors due to a conflation of subjective task value, some researchers suggest treating cost as a separate construct (Jiang et al., 2018, p. 140). Based on these assumptions, the conceptualization within the statisticsrelated SATS-M will be adopted for the present study, in which effort has a more central role as a separate construct and is thus partly incorporates the abovedescribed mechanisms of the feedback intervention model (Kluger & DeNisi, 1996). Concretely, effort in the present study is deemed a willingness to achieve, persist, and engage in learning influenced by EV appraisals and influencing academic achievement (Ramirez et al., 2012). Effort is thus construed as a concrete manifestation of EV-related motivation, which energizes learners to purposefully direct their effort towards goal achievement (Chang & Castro, 2021, p. 3). 1a) & b) Theoretical findings on the appraisal  achievement relation Regarding the interrelations of effort and achievement, formative assessment approaches are generally believed to be more easily reconcilable with the concept of engagement as both construe learning as an ongoing process (Nichols & Dawson, 2012, p. 465). This view goes hand in hand with the recurring rationales of mastery goal orientation, incremental theory, task-involving, and informational aspects that have been shown to be beneficial criteria of formative feedback throughout this chapter. For instance, the above-mentioned distinction between informational and controlling aspects of evaluations, stemming from cognitive

74

3

A Model for Reciprocal Interrelations …

evaluation theory, accounts for differential perceptions of formative and summative assessments regarding achievement-related behavior. While summative assessments give an ego-involving, mostly normative, information on the current knowledge level, informational, task-involving feedback is expected to propel motivation by addressing mastery goal rather than performance goal orientation (Nichols & Dawson, 2012, p. 464). Hence, formative assessment approaches are eminently suitable for fostering engagement because of their repeated occurrence and their emphasis on self-reflection during the ongoing learning process (Nichols & Dawson, 2012, p. 466). The already mentioned attribution theory plays a major role in how students adapt their effort based on previous performance or feedback (Schunk, 1989, p. 179). Based on the premise that effort is a less stable construct compared to ability, it is more easily adaptable to past mastery experiences (Schunk, 1989, p. 186). Mastery experience, constituting an important cue for ability attributions for incremental theorists, can thereby be assumed to result in higher persistence because students might realize that their previously invested effort pays off, leading to the means-end decision to invest additional effort for producing further success. More concretely, formative feedback, providing affirmation on one’s learning process, is assumed to have an amplifying impact on persistence because it confirms the learners’ implicit intelligence belief that there is a causal relation between their invested effort and their achieved performance (Fulton & Fulton, 2020, p. 78; Schunk, 1989, p. 180). This in turn propels the endeavor to further accumulate mastery experiences. In all, the theoretical state of research leads to the assumption that students with higher levels of effort are more successful and that past success in turn leads to higher effort. Kluger and DeNisi’s feedback intervention theory provides a framework for the effect mechanisms of cognitive effort as a consequence of a feedback-standard discrepancy (1996, p. 264). When the comparison of the feedback to the intended standard yields that the standard has been surpassed, the individual may decide to reduce effort or to further increase effort in order to achieve even higher standards. When the discrepancy is negative, i.e., the learners fall short of the expected standards, they are assumed to increase effort or to shift their attention to the self if they lack self-efficacy (Kluger & DeNisi, 1996, p. 260). The theory emphasizes that feedback in relation to the intended goals is needed to be able purposefully gear one’s effort to match with the goal demands (Locke & Latham, 2002, p. 708). Due to its strong cybernetic conceptualization of the human being as an adaptive system, the model however neglects motivational mechanisms, such as discouragement or non-acceptance of given standards, which might undermine the mobilization of effort in the face of failure. Therefore, effort also should

3.2 Formative Feedback and Achievement Motivation

75

be considered along with its interrelations with other motivational constructs as suggested in the EV model. 2a) Empirical findings on the appraisal → achievement relation In the statistics domain, empirical findings on the relation of effort with achievement are mixed. Tempelaar, van der Loeff, et al. (2007) and Tempelaar and van der Loeff (2011) found that effort was highly related to weekly quiz scores, but not to the final summative exam. The differential functioning of effort was assumed to stem from the dominating achievement orientation (instead of a learning approach orientation) due to the exam bonus points that could be acquired when participating at the quizzes (Tempelaar, vand er Loeff, et al., 2007, p. 96). Even though no such bonus points are achievable in the present study, this rationale is relevant because of the otherwise higher accessibility and flexibility of the quiz processing conditions apart from the explicit reward. More concretely, the uncontrolled, open-book assessment without time restrictions rather address the motivational component in planned effort; students who are diligent will likely receive a better score because they are allowed to spend more time on processing or looking up their course materials without having to recall the information. While researching the solution methods, the student however still accumulates and retrieves information, which likely contributes to formative knowledge acquisition (Evans & Culp, 2015, p. 88). By contrast, the exam is less accessible and more controlling than the quiz, which might lead to a domination of the learning approach component (Tempelaar, van der Loeff, et al., 2007, p. 97). This implies that high previous expenditure of effort might not be related to a higher exam score because other factors come into play such as nervousness and time pressure, so that the exam questions cannot be prepared in a similarly deliberate focused approach compared to the quiz questions. In addition to this, Tempelaar and van der Loeff (2011) found that the effort construct as conceptualized by the SATS-M had a positive correlation with surface learning (i.e., memorizing, rehearsing) and a negative correlation with deep learning strategies (critical thinking, relating, structuring). Moreover, they found summative exam performance to be positively related with deep learning approaches. Combing these two findings, they assumed that the predominant insignificance of effort is likely due to the surface learning approach resonating with the construct, thus rendering effortful learning ineffective for performance on summative exams and obscuring the hypothesized EV achievement mechanisms. Taken together, all these findings lead to the assumption that formative achievement is more closely linked with effort than summative achievement. Based on

76

3

A Model for Reciprocal Interrelations …

a literature review, Nichols & Dawson (2012), conclude that formative evaluations tend to interrelate with engagement and self-determination compared to summative evaluation systems. Budé et al. (2007) used tutor observations on a ten-point rating scale, parallel to self-reports, to measure student effort and persistence in an introductory statistics course. Effort ratings accounted for lecture attendance, preparedness, study time and active participation in group discussions while persistence accounted for “the extra mile” of consulting teachers and course materials in the face of obstacles. Only the observational data on student persistence correlated significantly with exam performance (Budé et al., 2007, p. 12), suggesting that engagement has to be measured on grounds of higherorder thinking levels, volitional learning strategies, or mastery goal orientations to be relevant for subsequent summative performance. Since the effort conceptualization of the present study does not account for persistence, but for the mere willingness to achieve, it might also turn out to be unrelated to summative performance. Other studies, such as Stanisavljevic et al., found significant relations between effort and statistics achievement, which was measured by means of a self-constructed multiple-choice test (2014, p. 2). 2b) Empirical evidence on the achievement → appraisal relation Concerning the relationship of feedback on effort, only the study of Tempelaar & van der Loeff (2011) could be found in the domain of statistics education, documenting that the quiz scores from an e-tutorial were a meaningful predictor of post-level effort. The authors however do not elaborate on this feedback effect apart from acknowledging the benefit of detailed preparation (Tempelaar & van der Loeff, 2011, p. 2814). Leaving the domain of statistics education, a study in which students engaged in an online computer lab consisting of simulations, frequent quizzes and feedback, fount that effort (measured by the time spent in the lab) related to higher achievement (Fulton & Fulton, 2020). Evans et al. (2021) also implemented pre-lecture quizzes to enhance engagement. Despite the high amount of 32 quizzes (three quizzes per week), they found a positive impact on engagement consistently throughout the semester. Given that quiz participation was incentivized to contribute only contributed .25 % to the final grade, no rigorous incentives to foster quiz participation seem necessary to channel its beneficial effect on subsequent effort (Evans et al., 2021, p. 169). In another study with weekly assessments and automated feedback, the researchers found that engagement in terms of completion rates was high (about 80%) and that only 2% of the highly engaged students dropped out of the mathematics course (Sancho-Vinuesa et al., 2013, p. 62). Subsequent qualitative interviews suggested that students

3.3 Formative Feedback and Achievement Emotions

77

perceived the prescribed regularity of the weekly assessments as helpful and necessary to avoid procrastination (Sancho-Vinuesa et al., 2013, p. 63). Jiang et al. (2018) as well as Perez et al. (2019) found that effort cost negatively predicted achievement among secondary school students. In these studies, the operationalization was however negatively connotated. For instance, the items refer to mental overload or insecurities about whether the invested effort will actually pay off or is worthwile, which does not match with the conceptualization of the present study as “willingness to achieve” (see above). Despite of the partly inconsistent empirical state of research, study recurring on the relations between formative assessments commonly found positive interrelations with effort. The present study thus formulates an analogous hypothesis, keeping in mind that different testing conditions (i.e., formative quiz versus summative exam) might contribute to varying effect mechanisms in the effort-performance relationship.

H5 :

Formative achievement and statistics-related effort reciprocally positively predict each other throughout the semester.

3.3

Formative Feedback and Achievement Emotions

3.3.1

Broadening the Conceptualization of Statistical Reasoning to Encompass Achievement Emotions in the Uptake of Feedback

The reception of feedback in achievement contexts is emotionally laden and goes beyond mere factual information processing (Lipnevich et al., 2021; Seifried, 2003, p. 207; Seifried & Sembill, 2005, p. 656). The aim of this study being to account for learning as a holistic process necessitates an integrative consideration of motivational, emotional, and cognitive appraisals. Apart from EV appraisals and omnibus constructs of affect, achievement emotions were neglected for a long time in research (Parr et al., 2019, p. 1). Most of the feedback models depicted in chapter parenthetically refer to emotional reactions—e.g., “affect”, (Carver & Scheier, 2000), “emotional reactions” (Zimmerman, 2000), “affective reactions” (Eccles & Wigfield, 2002). However, the model constructors do not concretize discrete emotional facets and their differential functioning in relation to achievement. Achievement emotions came to the fore in nascent research

78

3

A Model for Reciprocal Interrelations …

in the past 20 years (Camacho-Morles et al., 2021; Fong et al., 2018, p. 238; Seifried, 2003, p. 207). Exploratory studies in university contexts revealed that students experience a broad range of emotions, which were shown to impact academic achievement, deep learning, and the sense of wellbeing (Pekrun & Linnenbrink-Garcia, 2012, p. 260; Pekrun, 2018, p. 216). However, according to present knowledge and in contrast to EV appraisals, achievement emotions have been rarely considered in the context of statistics education. Moreover, even though the theoretical model of achievement emotions denotes feedback as a key element of emotion formation, the reciprocal and longitudinal interrelations are largely under-researched empirically, particularly in large lecture courses (Lipnevich et al., 2021; Riegel & Evans, 2021, p. 77; Peixoto et al., 2017, p. 386; Pekrun et al., 2014, p. 115; Peterson et al., 2015, p. 85; Seifried & Sembill, 2005, p. 656). Hence, knowledge is limited on how students emotionally respond to computer-mediated feedback and on how emotional processing fosters knowledge acquisition (Jarrell et al., 2017, p. 1263). Hence little is known about the changeability of achievement emotions around formative and summative assessments. In what follows, the above-elaborated theoretical model will be extended by an emotional component. First, Pekrun’s CV theory of achievement emotions will serve as a sound basis to theoretically explain the assumed relationships between feedback as well as emotional and motivational appraisals (2018, p. 217).

3.3.2

Motivational and Emotional Uptake of Feedback According to the Control-Value Theory of Achievement Emotions

Emotions are considered psychologically driven processes consisting of interwoven physiological, affective, cognitive, and motivational components that originate from the limbic system (Pekrun & Linnenbrink-Garcia, 2012, p. 260; Pekrun, 2006, p. 316). They can vary in terms of intensity, whereby less intense emotions are considered moods (Pekrun, 2006, p. 316). Achievement emotions are related to studying activities and their outcomes evaluated by preset standards (Pekrun, 2018, p. 218)10 . In the context of this study, achievement emotions are considered state-like because they are assessed over a limited time span of one semester in relation to a specific subject, whereas trait emotions span longer 10

Other emotional categories such as epistemic (cognitive reactions to task information, such as surprise), topical (related to the content of the learning material), and social (related to other persons) will not be addressed as they are considered less relevant for feedback reception processes.

3.3 Formative Feedback and Achievement Emotions

79

periods and refer to more general cross-domain feelings (Peixoto et al., 2017, p. 386). Based on exploratory research, Pekrun et al. (2002) determined relevant emotions among higher education students, such as boredom, hope, anger, pride, enjoyment, and hopelessness (Pekrun & Linnenbrink-Garcia, 2012, p. 260; see section 5.3.3). These emotions can be categorized according to their valence (positive vs. negative or pleasant vs. unpleasant), activation (activating vs. deactivating or energizing vs. relaxing), and object focus or time reference11 (ongoing activities vs. prospective or retrospective outcomes), yielding a 3 × 2 taxonomy (Pekrun, 2018, p. 216; Peixoto et al., 2017, p. 386; see Table 3.2). Table 3.2 Exemplary Overview of a Variety of Achievement Emotions according to the 3 × 2 taxonomy Positive Object focus

Activating

Negative Deactivating

Activating

Deactivating

Activity

Enjoyment

Relaxation

Anger

Boredom

Outcome-Prospective

Hope

Relief

Anxiety

Hopelessness

Outcome-Retrospective

Pride

Contentment

Shame

Sadness

An inclusion of all achievement emotion constructs would result in overcomplex analytical models. This is aggravated by the fact that several constructs were shown to be highly and indissolubly intercorrelated in a few validation studies (such as hope and enjoyment; Peixoto et al., 2015; Pekrun et al., 2004). Hence, enjoyment as highly controllable, pleasant, positive activating—and hopelessness as less controllable, unpleasant, negative, deactivating emotion were selected to balance components of opposite valance, activation, and level of control (Gómez et al., 2020; Pekrun et al., 2011; Pekrun & Linnenbrink-Garcia, 2012). Both emotions will be assessed in in-class/course-related and out-of-class/learning-related contexts. The two achievements contexts are to account for the different structures of academic situations, which might differentially affect achievement emotions (Pekrun et al., 2002, p. 95). Enjoyment and hopelessness are also considered relevant in depicting emotional trajectories because they are not terminative as most other retrospective and prospective emotions (pride, shame, relief), so that they 11

Prospective, concurrent, and retrospective time references of achievement emotions are less relevant in the present study. On grounds of the longitudinal design, for instance, hopelessness can be seen as retrospective when it is adjusted according to previous feedback information, or prospective when the student approaches upcoming coursework. Hence, emotions are considered both as reactions to past achievement and signals for the individual about what to do next (Fong et al., 2018, p. 239).

80

3

A Model for Reciprocal Interrelations …

can be assumed to meaningfully relate with preceding and succeeding feedback. Enjoyment in particular is included in the analyses since prior research overemphasized the impact of negative emotions while neglecting beneficial emotional impacts on motivation and achievement (Gómez et al., 2020; Pekrun et al., 2014). The CV theory embeds these achievement emotions in a comprehensive framework representing learners’ adaptation to educational, instructional, task-related learning characteristics driven by AME appraisals (Pekrun & Linnenbrink-Garcia, 2012, p. 271; Ranellucci et al., 2021). The CV theory integrates the aboveelaborated, EV and attributional theory as well as the transactional stress model in a broader way to holistically account for the complex interrelations of learning and classroom reality (Pekrun et al., 2017, p. 1654; Peterson et al., 2015, p. 82). The integrative view bases on the assumption that emotion and motivation are inseparable as motivated behavior eventually targets the current emotional state, and vice versa (Seifried & Sembill, 2005, p. 658). Figure 3.6 first depicts the complete CV theory model to subsequently particularize it to the context of the present study.

(Learning) Environment:

 Individual characteristics  Instructional and motivational quality  Autonomy, competence, relatedness  Achievement consequences and feedback

Motivational appraisals:

Emotional appraisals:

Achievement behavior:

 Controllability of task demands and learning activities

 Achievement emotions

 Engagement and effort

 Epistemic emotions

 Performance

 Subjective task value  Implicit theories

 Goal orientation

 Topic emotions  Social emotions

Feedback loop

Figure 3.6 Control-Value Model of Achievement Emotions. (Note. Source: Author’s own based on Pekrun (2006))

Based on the transactional stress model, CV theory postulates that the appraisal of the learning environment is a crucial determinant of motivational and emotional manifestations. Emotions are moreover conceptualized as result of cognitive evaluation processes (Conrad, 2020, p. 41). As proximal determinants, the self-perceived controllability and subjective value of achievement-related activities are assumed to give rise to a variety of achievement emotions, which in turn causally determine learning behavior and learning outcomes (Peixoto et al., 2017,

3.3 Formative Feedback and Achievement Emotions

81

p. 386; Pekrun, 2018, p. 228). Appraisals of control stem from a perceived sense of malleability of course actions and outcomes as well as from the estimated likelihood that certain actions produce certain outcomes in achievement-related cause-effect relationships (Fong et al., 2018; p. 238; Wigfield & Cambria, 2010, p. 5). Control is related to self-efficacy while value appraisals coincide with the subjective task value in the EV model (i.e., interest, utility, extrinsic value; Peixoto et al., 2017, p. 386). CV appraisals are assumed to instigate enjoyment or hopelessness (Pekrun & Linnenbrink-Garcia, 2012, p. 272). For instance, feeling competent and in control of mastering the coursework as well as deeming the activities intrinsically relevant for the future occupation is expected to foster enjoyment. By contrast, hopelessness is fostered, when a student feels disinterested and perceives a noncontingency between his actions and its outcomes, implying an inability to control the situation (Atiq & Loui, 2022, p. 4; Curelaru & Diac, 2022, p. 54; Pekrun, 2007, p. 589; Pekrun & Stephens, 2010, p. 283)12 . Low perceived control over potential causes for low performance might result in negative affective states such as learned helplessness (Gist & Mitchell, 1992, p. 196). Pekrun and LinnenbrinkGarcia (2012, p. 260) also address the role of effort within CV theory, considering achievement emotions their precursor, while effort in turn impacts academic achievement (Peterson et al., 2015, p. 83; Tempelaar et al., 2012, p. 163). This conforms to the view of effort being the catalyst of achievement behavior (see section 3.2.4) and has also been substantiated empirically (Pekrun & Linnenbring-Garcia, 2012). Hopelessness as a negative deactivating emotion is assumed to undermine engagement by setting the organism in a state of passivity, while enjoyment initiates and mobilizes impulses for further enlargement of one’s learning experience, flexible and self-regulatory learning strategies, cognitive elaboration, and other goal-directed actions (Pekrun & Linnenbrink-Garcia, 2012, p. 266). Comparable to the EV theory, the CV theory also considers learning processes to underlie feedback loops over time in line with dynamic systems theory, so that learners appraise positive or negative achievement emotions based on their prior success or failure. These emotions in turn enhance or diminish motivation to do well on the next task (Atiq & Loui, 2022, p. 4; Fong et al., 2018, p. 238; Jarrell et al., 2017, p. 1274; Lipnevich et al., 2021). These recursive cycles initiate selfregulatory strategies as students adapts their subsequent appraisals contingent on 12

The relationships between control-value appraisals and achievement emotions have been empirically corroborated e.g., in Buff (2014), Clem et al. (2021), Niculescu et al. (2016), and Peixoto et al. (2017) by means of latent SEM and latent-change models.

82

3

A Model for Reciprocal Interrelations …

their prior formative achievement. Pekrun et al. (2002) also noted that emotions following an evaluation are particularly intense. Recurring on the conceptualization of “emotional experience” (Seifried & Sembill, 2005, p. 658), emotions function as inner seismographs which evaluate the success of need satisfaction. Grounded in the attributional theory and SRL (Zimmerman, 2000), the emotional appraisal of performance-related experiences is then again contingent on their perceived controllability and value (Conrad, 2020, p. 40; Dettmers et al., 2011, p. 26; Pekrun et al., 2017, p. 1654) Accumulating success (or failure) experiences, and their anticipated consequences, foster (or undermine) student’s sense of control of the subsequent outcomes, so that learning environments relying on regular formative feedback are expected to influence achievement emotions (Lipnevich et al., 2021). The value of achievement is usually addressed when students deliberate about the consequences of their prior success or failure (i.e., future career chances; Conrad, 2020, p. 47; Curelaru & Diac, 2022, p. 54; Pekrun, 2018, p. 228; Wortha et al., 2019, p. 3). Positive feedback may be attributed to success and thus prompts students to keep on engaging in the subject, and higher task involvement may then again foster enjoyment in the next feedback loop. Therefore, the CV framework is considered a model of reciprocal causation in which achievement outcomes are reciprocally related with the external environment, AME appraisals over time (Pekrun & Linnenbrink-Garcia, 2012, p. 277; Peixoto et al., 2017, p. 386)13 . The reciprocal causality draws on the assumption of social cognitive theory that, on the one hand, the learning environment influences self-regulatory strategies while, on the other hand, individuals regulate the internalization of these external stimuli. CV theory provides a meaningful framework to analyze in how far instructional parameters, such as formative feedback, lead to different AME appraisals to support learning processes after solving performance tasks (Jarrell et al., 2017, p. 1263). Most importantly, the CV theory transforms the EV theory from a problem-focused model (i.e., identifying difficulties in statistics learning on grounds of attitudes) to a solution-focused model; by explicitly including the learning environment as a manipulable element that instructors can use to further influence the appraisal-achievement relationship, the integrated model serves as a theoretical basis to address the lack of findings on instructional influencing factors of AME appraisals (see section 1.3). Figure 3.7 particularizes the above-depicted CV theory to the present study context. 13

CV theory assumes that the learning environment is linked by reciprocal causations by, for instance, influencing the behavior of teachers by means of emotional contagion. This reciprocal link is neglected in this study due to the large lecture setting. Hence, it is considered that achievement outcomes feedback directly to cognitive appraisals.

3.3 Formative Feedback and Achievement Emotions

83

(Learning) Environment:

Motivational appraisals:

Emotional appraisals:

Achievement behavior:

Individual characteristics: gender, prior knowledge

Control: selfefficacy, difficulty

Achievement emotions: enjoyment, hopelessness (in-class and outof-class)

Effort

Instructional characteristics: traditional, flipped

Value: interest, utility value, affect

Performance

Formative feedback Feedback loop

Figure 3.7 Feedback-Related Achievement Motivation and Emotion (FRAME) Processing Model. (Note. Source: Author’s own based on Pekrun (2006))

Ensuing from the complete CV theory framework, the study focuses on CV appraisals, enjoyment, hopelessness, as well as effort and course performance. Regarding the external factors, feedback is considered the driving force of motivational and emotional trajectories. Before the integration of this model excerpt with the theoretical model elaborated in section 3.2.3, empirical evidence on the relationship between enjoyment, hopelessness, feedback, and academic achievement will be collated for further hypotheses generation.

3.3.3

Reciprocal Relations between Enjoyment, Hopelessness, and Achievement

Helmke et al. aptly describes enjoyment as an “affective coloring of learning” triggered by certain learning situations (2007, p. 18). Enjoyment is considered an affective state of pleasure, emotional cathexis, and flow. Flow represents a conscious state of individuals when they find a task intrinsically enjoyable and become absorbed in the learning activity (Goetz et al., 2012, p. 226). Feelingrelated valence in this vein differs from interest and utility value, which are considered value-related valences. While value-related valences are determined by personal importance, feeling-related valences arise from direct involvement and stimuli from the learning environment (Conrad, 2020, p. 49; Wigfield & Cambria, 2010, p. 9). Hopelessness is an emotion that may occur while feeling confused about complex tasks that cannot be resolved immediately and entails feelings of deactivation, irritation, and failure, likely resulting in disengagement (Arguel et al., 2019, p. 204).

84

3

A Model for Reciprocal Interrelations …

1) Theoretical evidence on the appraisal  achievement relation According to the resource allocation theory, emotions regulate cognitive resources of the working memory by directing the attention to the object of emotion. The object focus being the learning activity itself (see section 3.3.2), enjoyment paves the way for the full use of cognitive resources, creative thinking, and metacognitive learning strategies, which directs the attention towards task completion (Conrad, 2020, p. 41; Peixoto et al., 2017, p. 387; Pekrun & Stephens, 2010, p. 271). On the contrary, feeling hopeless is considered maladaptive towards learning, because this feeling shifts attention away from the task to anticipated consequences of failure at the self-level as criticism (Evans, 2013, p. 96; Lipnevich et al., 2021; Pekrun & Linnenbrink-Garcia, 2012, p. 264). Comparable to the interference model mentioned in section 3.2.4, such task-irrelevant cognitive processes reduce cognitive resources, disrupt efficient information processing of subject matter, and impair performance (Jacob et al., 2019, p. 1770; Peixoto et al., 2017, p. 387; Pekrun et al., 2002, p. 98). The theory of mood-congruent retrieval assumes that emotions affect storage, retrieval, and processing of information. Assuming like-valenced processing of information, students that approach tasks with higher levels of enjoyment are more likely to be attentive to and recall positive task-related information while higher levels of hopelessness promote the retrieval of negative, task-irrelevant information, leading to avoidance (Pekrun et al., 2002, p. 97). Boekaerts’ & Cacallar’s dual processing model (2006) takes the same line by hypothesizing that dispositions determine the task-related coping mechanism. If learners approach a task with negative emotional predispositions, or when they find an incongruency between personal goals and task goals, they tend to pursue a bottom-up, well-being pathway, i.e., an avoidance strategy to protect the self from further burdens (Panadero, 2017, p. 6). When the task conforms with the psychological state of mind in such a way that task completion is driven by learners’ positive predispositions, they will likely strive for an amplification of their competence on the top-down mastery pathway (Panadero, 2017, p. 20). Enjoyment and hopelessness were also found to impact attentiveness and motivation, and thus could also be assumed to indirectly influence feedback receptivity (Jarrell et al., 2017, p. 1266). These lines of theoretical research suggest that positive emotional appraisals likely facilitate a beneficial processing of learning content thus promoting performance, while the opposite would be true for negative emotions. 2a) Empirical evidence on the achievement → appraisal relationship Concerning the interrelations between emotions and academic achievement, enjoyment was generally found to positively relate to performance, and with

3.3 Formative Feedback and Achievement Emotions

85

the inverse pattern for hopelessness (Dettmers et al., 2011; Frenzel, Thrash, et al., 2007, p. 306; Gómez et al., 2020; Jarrell et al., 2017; Lichtenfeld et al., 2012; Peixoto et al., 2017, p. 387; Pekrun et al., 2011, p. 45; Pekrun & Linnenbrink-Garcia, 2012; Pekrun & Stephens, 2010, p. 271). Pekrun et al. (2017) confirmed reciprocal relationships between enjoyment, hopelessness, and achievement by means of cross-lagged SEM with a sample size of about 3,000 middle school students (Grades 5–9). In a meta-analysis of activity-related emotions, Camacho-Morles et al. (2021) confirmed a moderately positive relationship between enjoyment and achievement (based on 57 study samples). A further moderator analysis in their meta study also revealed that the relationships were stronger for secondary school students compared to higher education and primary school. In an overview of ten empirical studies (three longitudinal and seven cross-sectional), Pekrun et al. (2012) found that enjoyment positively related to intrinsic motivation, effort, and academic achievement whereas hopelessness correlated negatively. Tempelaar et al. (2012) applied path modeling to find out that enjoyment positively predicted quantitative participation in mathematics and statistics practice quizzes of first-year business and economics students. Generally, the findings on the emotion-performance relationship corroborate the above-elaborated theoretical state of research. In some studies, however, only hopelessness was found to be a significant predictor of achievement, but not enjoyment (Curelaru & Diac, 2022; Peixoto et al., 2017; Putwain, Sander, et al., 2013; Tempelaar et al., 2020; Wortha et al., 2019). Moreover, hopelessness being a negative emotion, was occasionally reported only in low intensities (Jarrell et al., 2017; Pekrun et al., 2002, p. 93). The smaller range might result in limited statistical variance, which might fall short of making meaningful differences between students perceiving varying manifestations of hopelessness. Because of the above-mentioned, occasional insignificances and limited ranged of emotional appraisals, researchers began to examine meaningful, homogeneous emotional profiles by means of person-centered, intra-individual cluster analyses and related them to performance. A majority of these studies found a positive, a neutral, and a negative emotion cluster, whereby membership in the positive emotion profile was associated with higher-performing students, mediocre performance with the neutral emotion cluster, and lower performance with the negative clusters (Jarrell et al., 2017; Wortha et al., 2019). In all, these findings also suggest that enjoyment positively relates to academic achievement while hopelessness relates negatively. 2b) Empirical evidence on the achievement → appraisal relationship The impact of feedback on achievement emotion appraisal, was investigated in a limited number of empirical studies that found small to moderate effect sizes

86

3

A Model for Reciprocal Interrelations …

between emotions and performance (Lipnevich et al., 2021). Pekrun et al. (2014) compared the emotion-enhancing effects of anticipated, self-referential feedback (i.e., comparing a students’ performance over time) with those of normative feedback (i.e., comparing a students’ performance relative to other students) by means of univariate ANOVAs. Anticipation of self-referential feedback was related to subsequent higher test enjoyment while normative feedback related to higher hopelessness (Pekrun et al., 2014, p. 122). However, the study only used the mere preceding announcement of the respective types of feedback as independent variables and not the actual performance in these tests in their analyses. Since feedback in the present study consists of self-referential rather than normative information, it could at least be assumed that feedback reception does not have adversary effects on subsequent emotions. Clark and Svinicki (2015) investigated the impact of retrieval (i.e., 10 short questions followed by feedback and reprocessing of the same questions in a follow-up session) compared to restudy methods (reprocessing without receiving feedback). Within-repeated measures ANCOVA showed that participants receiving positive feedback in the retrieval condition enjoyed learning significantly more than the restudy group. By use of mean comparisons, Riegel and Evans (2021) found that students experienced more enjoyment and less hopelessness when processing betweenlecture online quizzes compared to traditional, controlled exams. Subsequent qualitative analyses suggested that students preferred the more manageable and controllable conditions of the online quiz (i.e., open book, looser time restrictions), which conforms to the empirical findings on anxiety and formative achievement (see section 3.2.4). However, the researchers did not assess CV appraisals as another important influencing factor. Rüth et al. (2021) compared the reception of feedback of correct results with elaborated feedback when using quiz apps in two experimental studies by means of ANOVA and found that students perceived the quizzes equally enjoyable in both conditions. Fong et al. (2018) assessed emotions by means of open-ended questions following constructive or negative feedback of a writing task. Upon receiving positive feedback, students had indicated higher enjoyment because it provided a path towards improvement and construed it as corroborative praise. By contrast, upon receiving negative feedback, students had frequently reported to feel hopeless because of a lack of efficacy or they adopted a defeatist stance as they felt to have fallen short of the course expectations (Fong et al., 2018, p. 248). Curelaru and Diac (2022) compared emotional appraisals depending on the perceived learningor performance-orientation of an assessment environment among undergraduate Language and Literature students. Mastery-approach orientation negatively

3.3 Formative Feedback and Achievement Emotions

87

related to hopelessness while performance-orientation related negatively to enjoyment and positively with hopelessness. These relationships between emotions and achievement goals were corroborated in several other studies (e.g., Goetz et al., 2016; Pahljina-Reini´c & Vehovec, 2017; Putwain, Larkin, et al., 2013). They also align with the findings from section 3.2.4 such that students using a mastery and incremental approach towards learning tend to process assessments in more favorable ways compared to entity theorists relying on performance approaches. All of the above-mentioned findings suggest a tentative positive relationship between feedback reception and enjoyment and a negative relationship with hopelessness. However, despite feedback being declared a “primary force of shaping students’ emotions” (Pekrun & Linnenbrink-Garcia, 2012, p. 12) in the scope of CV theory, empirical findings are meager. Broadening the scope of research to more elaborated and visualized provision of feedback suggests that the impact of feedback on emotions depends strongly on the delivery method (Jarrell, 2017, p. 1266). Empirical studies suggest, for instance, that the feedback-enjoyment relationship is reinforced when feedback is conveyed by virtual agents in gamebased learning14 (e.g., Chan et al., 2021; Crocco et al., 2016; Guo & Goh, 2016; Wang et al., 2016; Zainuddin et al., 2019). The present study therefore aims to provide further insight into whether feedback conveyed in sparser ways (i.e., neutrally presented numeric scores in a learning management system) is able to emotionally appeal to students.

3.3.4

Differential Perceptions of Achievement Emotions in In-class and Out-of-class Learning Contexts

A final nuance for hypothesis generation resides in the distinction between course- and learning-related achievement emotions. Despite the findings of confirmatory factor analyses that these two emotional contexts serve to account for different prevalent learning environments influencing the interplay between emotions and achievement (Pekrun et al., 2011), there has never been a systematic elaboration on the characteristics of this conceptual differentiation (Goetz et al., 2012, p. 225; Ranellucci et al., 2021, p. 2)—let alone empirical findings. Most of the above cited research accordingly focused on course emotions. From a theoretical perspective, out-of-class learning could be principally based on a higher degree of autonomy because the learner studies at an individual, self-regulated 14

Virtual agents in digital learning environments are embodied learning companions, e.g., avatars, to increase the affective appeal of instruction.

88

3

A Model for Reciprocal Interrelations …

pace independent from the instructor’s directives. Classroom learning, by contrast and particularly in traditional lectures, is more narrowly structured and externally regulated by instructors (Goetz et al., 2012, p. 226). Moreover, since the reception of quiz feedback takes place outside the classroom, it could be a more salient source of out-of-class emotion regulation. On grounds of the allegedly higher sense of control and flexibility outside the class regarding what and when to learn, a vague supposition could be that feedback effect operates more strongly in relation to learning emotions (Ranellucci et al., 2021, p. 2; Resnik & Dewaele, 2021). On the other side of the coin, students might also feel overwhelmed in more flexible out-of-class situations as they also entail greater responsibility (Daniels & Stupnisky, 2012, p. 224). Depending on the respective learning disposition, they might enjoy the social presence in class situations more than isolation at home (Muir et al., 2019, p. 264). These different preferences for either course or outof-class contexts could balance each other out on average. Accordingly, some researchers argue that similarities between contexts might outweigh the differences such that the attributes of course instruction spills over to out-of-class learning (e.g., by means of assignments or course materials studied outside the class, teaching styles and expectations). In other words, students might perceive learning activities in and outside the course to be a part of the same overall learning setup (Goetz et al., 2012, p. 226; Ranellucci et al., 2021, p. 3). In a review of four studies which had compared the differences between traditional and online classrooms, Daniels & Stupnisky (2012) found that the roles of CV appraisals on achievement emotions did not vary across learning contexts. Dettmers et al. (2011) found that unpleasant emotions during out-of-class assignments predicted achievement in a similar way as course emotions. They assumed that despite the differences in learning environment characteristics, the relationship between learning and course emotions, CV appraisals as well as achievement may be rather stable when both contexts are considered supportive of control and value to a similar extent. Similarly, Ranellucci et al. (2021) conducted latent correlational analyses on a sample of 269 undergraduate students from an anatomy course and found similar patterns of relationships irrespective of the context. Apart from the attested similarities, some research also found differentially function patterns regarding the emotional context. Under the umbrella of homework emotions, Goetz, et al. (2012) were one of the few researchers empirically addressing the out-of-class context of achievement emotions in secondary school. By means of latent correlational analyses, they found that self-efficacy was slightly more strongly related to class-related emotions compared to learningrelated emotions. Peixoto et al. (2017) corroborated these findings using the same

3.4 Multiplicative Effects of Expectancy-Value Appraisals …

89

data analysis method. Due to the middle school samples, the findings are however barely transferable to the present large lecture study because in-class instruction in the other studies was characterized by a higher degree of teacher-student interaction to assist course-related emotion regulation (Peixoto et al., 2017, p. 399). In an experimental design on 43 students of a graduate information systems course, paired t-tests suggested higher satisfaction of working collaboratively in face-toface (f2f) meetings compared to asynchronous group interaction via computer conferencing (Ocker & Yaverbaum, 1999). Resnik & Dewaele aptly formulate their findings along the same lines as suggesting that “disembodied classes have less emotional resonance” (2021, p. 1). Specifically, they analyzed the impact of a COVID-19-induced emergency remote teaching context on enjoyment of English foreign language course with 500 tertiary-level students. The results of the paired t-tests indicated that students enjoyed the online classes less than the in-person classes. In short, the state of research on learning- and course-related emotions is sparse and inconsistent. Some studies suggest that course emotions tend to be more relevant in the CV theory framework. Relating to smaller school classes or online classes conducted in the out-of-class context, most of them are however only transferable to a limited extent to the present study. The hypotheses therefore tentatively assume that learning- and course emotions function similarly in their relationship with academic achievement.

H6 :

H7 :

3.4

Formative achievement and (a) course- and (b) learning-related enjoyment reciprocally positively predict each other throughout the semester. Formative achievement and (a) course- and (b) learning-related hopelessness reciprocally negatively predict each other throughout the semester.

Multiplicative Effects of Expectancy-Value Appraisals on Achievement Emotions and Performance

After having outlined the coherences between feedback, AME appraisals, a final inter-construct relationship that needs to be clarified is that between expectancy

90

3

A Model for Reciprocal Interrelations …

and value. More precisely, the EV theory, or expectancy × value theory, to be precise, fundamentally, and historically assumes an interactive rather than an additive relation between both constructs in their impact on subsequent achievement emotions, and achievement, eventually (Eccles & Wigfield, 2020, p. 6; Pekrun, 2006, p. 320; Perez et al., 2019, p. 29). This multiplicative effect assumes that cognitive functioning and perceived value intertwine in such a way that motivation is high only if self-efficacy and value appraisals are high as well (Harackiewicz, Smith, et al., 2016, p. 221; Nagengast et al., 2011, p. 1058). For instance, when students appreciate a task in terms of its significance—and perceive it to be congruent with their capabilities, EV appraisals should make the greatest impact on the dependent constructs (Acee & Weinstein, 2010, p. 491). The performance-enhancing effect of self-efficacy is thus expected to be stronger for students attaching greater value to the academic tasks (Trautwein et al., 2012, p. 763). The constellation can also be less favorable for a student who feels capable of completing a task (high selfefficacy) but does not value it sufficiently (low task value) and therefore might decide to disengage from the task (Cole, 2008, p. 613). The disruptive effect equally applies to a student that sees the personal importance of an upcoming tasks, but feels unable to cope with it, which might trigger negative emotions or maladaptive achievement behavior (Jacob et al., 2019, p. 1769). Despite the fact that the multiplicative assumptions have been a part of the EV theory since its early years, these effects have not been much researched (Guo et al., 2015, p. 162; Perez et al., 2019, p. 29), leading Nagengast et al. of “who took the ‘× ’ out” of the EV theory (2011, p. 1058). In the last two decades, research was therefore again devoted to the investigation of the intended multiplicative effect. Few studies on university students tentatively confirmed the existence of performance-enhancing EV interaction effects. For instance, Trautwein et al. (2012) as well as Meyer et al. (2019), using latent moderated SEM on large datasets of upper secondary students, found that the relationship between different types of subjective task value15 and achievement was more positive at higher levels of self-efficacy. As regards achievement emotions, a multiplicative EV effect impacted enjoyment in a university statistics course, i.e., when both appraisals were high, enjoyment was also high, and vice-versa (Pekrun et al., 2002). Berweger et al. (2022) conducted another study with 95 university students in an online learning environment with automated feedback including less 15

Other researchers also considered interactive effects of expectancies contingent on effort costs (as part of subjective task value) and found that the impact of expectancies on achievement were more pronounced for students with a low effort cost perception (Perez et al, 2019, p. 29).

3.5 Average Development of Achievement Motivation …

91

complex and problem-based tasks. A hierarchical design using multilevel analyses revealed a significant moderation of expectancy on hopelessness at high levels of cost value (Berweger et al., 2022, p. 7). The authors however assessed each emotional state by means of one manifest item only. Shao et al. (2020) as well as Wu and Kang (2021) found significant interactions of CV appraisals in predicting enjoyment, hopelessness, and achievement in foreign language learning by means of latent interaction analyses. Both samples however consist predominantly of female students (which is typical of foreign language courses), and the second study only considered attainment value instead of utility and interest value. The majority of other recent studies stems from large samples of middle school students, which attested interaction effects of EV appraisals on achievement, engagement, course selection, and career aspirations, mostly using latent interaction SEM (Fong et al., 2018; Guo et al., 2017; Nagengast et al., 2011; Xu, 2020; no significant interaction on achievement was found in Song & Chung, (2020)). In all, the above results suggest that EV interaction effect on both may occur in predicting students’ achievement emotions and academic achievement. This suggests that it is appropriate to check the data for combined effects of expectancy and value appraisals. The above-cited findings from middle school datasets however lack transferability to the higher education context and the interaction effects in EV theory observational studies were typically rather small in size (Trautwein et al., 2012, p. 3; Meyer et al., 2019, p. 59). Hence, in the later analyses, interaction effects will be checked, but no specific hypotheses will be formulated due to the uncertainty of their occurrence.

3.5

Average Development of Achievement Motivation and Emotion throughout a Semester

Moving from the internal appraisal-performance relationships to the exterior recursive feedback loop, it is also relevant to know in how far the investigated constructs were found to change on average throughout a semester without any additional instructional influencing factors. Within the context of statistics, it also has to be considered that AME appraisals at the beginning of the semester are likely to be inaccurate since it is a completely new subject domain for most students (Niculescu et al., 2016, p. 290). Schau assumes that mean attitude scores are likely to decrease when students start the statistics course with averagely high attitude because they have no clear conception of the subject statistics when transitioning from high school to university (2003). Gal and Ginsburg add that, due

92

3

A Model for Reciprocal Interrelations …

to the “fuzziness of the term ‘statistics’” (1994), students might not know what to expect in the course, so that they base their judgements on vague, unrealistic assumptions regarding the content and on past experiences (such as from the field of mathematics). Hence it could at least be assumed that there is at least some variation in the appraisals throughout the semester because students develop a more realistic and differentiated view of the subject. Concerning the change of mean differences in statistics-related EV appraisals, empirical findings are inconsistent. Few studies documented improvements on all attitude scales (Chiesi & Primi, 2010; Waters et al., 1988). Some studies found improvements in some components, such that statistics-related self-efficacy improved throughout the semester (Burnham & Blankenship, 2020; Finney & Schraw, 2003; Harpe et al., 2012; Kiekkas et al., 2015; Milic et al., 2016), anxiety decreased (Showalter, 2021; Milic et al., 2016; Kerby & Wroughton, 2017, Kiekkas et al., 2015), and difficulty decreased (Harpe et al., 2012). Interest, value, and effort were occasionally found to minimally decrease in the context of undergraduate statistics (Burnham & Blankenship, 2020; Gundlach et al., 2015; Kerby & Wroughton, 2017; Milic et al., 2016; Schau & Emmioglu, 2012) and secondary school mathematics courses (Gaspard et al., 2015; Wigfield & Cambria, 2010) while fewer studies on statistics documented an increase (e.g., Kiekkas et al., 2015). Schau and Emmioglu (2012, p. 92) as well as Emmioglu et al. (2018) reported the average effort appraisal to be very high at the beginning while slightly decreasing throughout the semester. The above-mentioned positive EV changes mostly occurred in smaller samples (approx. 100–150 undergraduates) while negative changes were found in studies with large-scale data (approx. 200–2,000 undergraduates; Schau, 2003; Schau & Emmioglu, 2012). These negative changes were only minor because students started with neutral or above-average mean values. Other studies also documented that statistics EV appraisals, such as self-efficacy, difficulty and affect, were recalcitrant to change (Schau & Emmioglu, 2012). As regards achievement emotions, there has not been an extensive research tradition as compared to the attitudes towards statistics that had aimed at analyzing average change throughout a semester. The few existing studies moreover focus on middle school students (grades 5–9). Starkey et al. (2018, p. 454), Ahmed et al. (2013) and Pekrun et al. (2017) found that enjoyment and hopelessness significantly decreased throughout middle school. In a meta-analysis, CamachoMorles et al. (2021) found that the emotion-performance association toned down for university students compared to secondary students. Using latent change models on a large sample of first-year students, Niculescu et al. (2016) found that learning-related enjoyment and hopelessness remained stable throughout the

3.6 Intertwining the Expectancy- and Control-Value …

93

semester. The stability was confirmed for a large sample of psychology students (Tze et al., 2021). In the scope of latent transition analysis, the researchers found that the majority of students remained in a stable positive emotion profile over the one-year course. In sum, as regards EV appraisals, the inconsistent findings likely stem from the variety of different samples (i.e., course size, course conditions and complexity, type of educational institution, country) and the different attitude measures used (Xu, 2020, p. 2). The evidence could also be biased due to the fact that most of these studies only tested for mean differences without testing factorial equivalence across time. Regardless of the direction, the majority of studies found that EV appraisals towards statistics changed moderately, so that is can be assumed that they are at least marginally responsive to instructional influencing factors, such as feedback. By contrast, the findings regarding achievement emotions suggest an amortization effect according to the notion of developmental equilibrium such that emotional appraisals become less expressive and volatile as students grow up (Arens et al., 2022, p. 618). School students are assumed to feel their emotions more intensively because they are by nature more susceptible to fascination for the outside world, to which adults are likely already habituated with a tendency to conceal their emotions (Camacho-Morles et al., 2021; Starkey et al., 2018, p. 460). Another related reason for toned-down emotions and the intrinsic devaluation of academic tasks lies in the increasing academic demands, evaluative pressure and the growing goal conflicts between academic and personal interests, which leads to a deprioritization of personally irrelevant subjects (Pekrun, 2017; Wigfield & Cambria, 2010, p. 16). Finally, opposed to thee documented moderate fluctuations for EV appraisals, the stability and smaller variance in achievement emotions over time might also attenuate the responsiveness to potentially influencing instructional measures and thus limit the feedback effect in the context of the present study.

3.6

Intertwining the Expectancy- and Control-Value Theory to an Integrative Model of Motivational and Emotional Feedback Processing

The two theoretical approaches of EV and CV theory served as a rationale for the hypothesized relationships between achievement and AME appraisals, respectively. After having outlined the internal relationships between these constructs, the hypotheses will be integrated in an overarching theoretical framework which unifies the commonalities of both approaches. The common denominator of both

94

3

A Model for Reciprocal Interrelations …

theories is the assumption that individual choices, persistence, and achievementrelated behavior are reciprocally interrelated with a multiplicative conjunction of expectancy/control and value appraisals. On the one hand, EV theory places a stronger focus on the effect mechanisms between EV appraisals and achievement contingent on prior achievement and the individual cultural surroundings (e.g., family environment, gender stereotypes). On the other hand, CV theory integrates emotions as consequences of EV appraisals and the learning environment along with instructional characteristics (i.e., provision of feedback) as necessary influencing factors. That way, both approaches can be integrated synergistically with each other, leading to the following tentative, theoretical model along with the above-generated hypotheses in Figure 3.8.

H1 & (H2)

feedback loops across time H1 & (H2)

Control Self-Efficacy Difficulty

Formative (bt-1 x)

Effort

H5

Formative (yt)

Summative

H3

H6 & (H7)

Achievement emotions Enjoyment Hopelessness (in-class and out-of-class)

Value Utility value Interest value

H4

H3

Affect

H4 feedback loops across time

Figure 3.8 Analytical Adaptation of the FRAME model with Corresponding Hypotheses. (Notes. Source: Author’s own; Hypotheses in brackets refer to a negatively assumed relationship)

In this model, formative achievement at the occasion bt-1 refers to the prior achievement-related experiences according to the EV model (see Figure 3.3) and feedback provision according to the learning environment factors of the CV theory (see Figure 3.6). While CV appraisals were already represented in both separate approaches, achievement emotions were added in the integrated model as consequence of CV appraisals according to the CV theory. The positioning of

3.6 Intertwining the Expectancy- and Control-Value …

95

the effort construct has to be elaborated in greater detail. Analogous to the elaborations in section 3.2.9 and according to the CV theory, effort in the present study is conceptualized as readiness to perform and learn, acting as go-between of EV appraisals and achievement. Therefore, the effort construct is moved from the subjective task value section, which was the conceptualization of the EV model, to the closest position of formative achievement. The hypotheses drawn on the arrows on Figure 3.8 are aligned to the appraisal-performance relationships outlined in the previous chapters. Each hypotheses occurs twice in order to reflect that the reciprocity of these relations will be investigated. The above-depicted theoretical is context-independent. Knowledge acquisition in the domain of statistics education has however been shown to be dependent on several heterogeneity criteria, i.e., gender, prior knowledge (Ramirez et al., 2012) and course design. Therefore, in a final step, the theoretical model will be related to these contextual criteria to investigate the generalizability of feedback-related learning processes and potentially differential effects in the assumed appraisalperformance relationships. This contextual framework also helps to determine under which conditions and for which groups of students the benefit of receiving feedback is highest.

4

Further Contextualization of Motivational and Emotional Feedback Processing

4.1

Variables and Contexts Considered Relevant for Feedback Processing in Statistics Education

In the research tradition of statistics education, gender and prior knowledge have been found to generate heterogeneity in students’ appraisals in relation to EV appraisals in mathematical and statistical domains with small to moderate effect sizes (Macher et al., 2013; Ramirez et al., 2012). Due to their relevance, these student characteristics have been implemented in the EV- based SATS model (SATS-M) as major influencing factors for motivational and emotional development to account for students learning statistics (Ramirez et al., 2012). Therefore, both sociodemographic variables will be factored in the present theoretical model, assuming that they bring most likely some variation into the hypothesized relationships. The importance of contextual influences on EV appraisals had already been pointed out in earlier works of Wigfield and Eccles (2002, p. 114). Consideration of the classroom context is important to determine optimal challenges for motivating students and to assess the efficacy of various teaching approaches (Wigfield & Eccles, 2002, p. 114). Despite this constant reminder, not much is known about the efficacy of computer-mediated feedback and their influence on AME appraisals in different educational contexts. Expanding the trans-contextual relations of the appraisal-feedback relationships will therefore be investigated depending on two different course designs. The inclusion of the course design renders the feedback framework more interactional in such a way that feedback source (i.e., the instructional medium) and the feedback receivers (i.e., students’ internal processes) are accounted for (Panadero & Lipnevich, 2022, p. 9). In order to investigate the reciprocal feedback processes, each chapter starts with a theoretical and empirical consideration of differences

© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 A. Maur, Electronic Feedback in Large University Statistics Courses, https://doi.org/10.1007/978-3-658-41620-1_4

97

98

4

Further Contextualization of Motivational and Emotional …

within the AME appraisals in the context of statistics education according to the influencing factors. From these variations, potentially differential mechanisms in feedback reception will be inferred in a second step. In the following, gender, prior knowledge, and course design are processed in the order of their mutability and according to their binary manifestation (i.e., male/female, low achieving/high achieving, traditional/flipped).

4.2

The Prevalent Gender Differential in Statistics-related Motivational and Emotional Appraisals

4.2.1

Gender-related Mean Differences in Motivational and Emotional Appraisals in Theoretical and Empirical Research

Based on the assumptions of emergence, and interdependence of AME appraisals on formative feedback, the theoretical model will be further elaborated in order to account for potentially moderating influences which are considered relevant in the scope of statistics education. Math and statistics are stereotyped and maledominated domains (Arens, 2020, p. 619). Therefore, a core influence of gender differences lies in societal beliefs that already begin to play a major role during parental upbringing. Parents or other socializers may misjudge girls’ inclinations to mathematics-related domains, portray these domains as typically masculine, or otherwise fail to convey appreciation for these subjects (Harackiewicz et al., 2012, p. 902; Marshman et al., 2018; Pajares, 2002, p. 119). Such a socially imposed stereotype may increase the fear of female students of confirming an alleged inferiority, which has been shown to induce maladaptive behaviors (Autin & Croizet, 2012, p. 615). In that way, the stereotype may become a self-fulfilling prophecy resulting in actual performance differences, also referred to as stereotype threat (Davies & Spencer, 2005; Doyle & Voyer, 2016; Franceschini et al., 2014). Empirical findings, fortunately enough, provide no strong support for performance-related differences to the detriment of female students (Lauerman et al., 2017, p. 1541). When achievement differences were documented, they were usually weak (Lauermann et al., 2017, p. 1543). For instance, only few studies documented significant differences favoring males (e.g., Tempelaar et al., 2006). In a meta-analysis of 13 studies, Schram (1996) reported an average effect size of d = .08 in favor of female students in statistics achievement. In most other empirical studies, gender-related differences in performance were found to be

4.2 The Prevalent Gender Differential in Statistics-related Motivational …

99

insignificant (Beschrakis et al., 2010; Bradley & Wygant, 1998; Kiekkas et al., 2015; Macher et al., 2013). While the findings on performance differences are rather fuzzy, a clearer picture emerges from the consideration of gender-related differences in EV appraisals. As Frenzel, Pekrun, et al. put it for school mathematics, the small performance differences are opposed to consistent differences in the affective domain (2007, p. 498), which also applies to the context of statistics education. Assuming that mathematics-related fields are more appropriate for males, the stereotype threat could also a imply lower expectancies for success on part of female students, for instance, when they are believed to be less competent in these domains, and when male accomplishments in that domain are overemphasized (Bandura, 1997). As Emmioglu et al. put it, a male dominance might create a threatening environment for female students causing them to underestimate their own capabilities and to have more negative AME appraisals (2019, p. 126). The above-mentioned role of stereotyped socializers in the upbringing could also mislead women into losing interest in STEM-related careers (Harackiewicz et al., 2012, p. 901; Rosenzweig & Wigfield, 2019, p. 152). Accordingly, gender-related differences have been found to manifest themselves empirically in EV appraisals when passing secondary school mathematics and higher education statistics courses (Ramirez et al., 2012). In accordance with the stereotype assumption, starting with secondary school findings, studies found that particularly female students have a lower expectancies and higher anxiety levels both in secondary school mathematics (Goetz et al., 2013; OECD, 2015). Moreover, female secondary school students have been found to attach lower interest and utility value to mathematics compared to male students, thus perceiving mathematics as rather unappealing subject (Frenzel, Pekrun, et al., 2007; Gaspard et al., 2015). The most consistent finding in statistics education research is that male students report significantly higher levels of expectancy appraisals and statistics anxiety (Bechrakis et al., 2011; Emmioglu et al., 2019; Förster & Maur, 2015; Hommik & Luik, 2017; Opstad, 2020; Tempelaar & van der Loeff, 2011). Fewer studies documented insignificant gender differences (Cashin & Elmore, 2005; Coetzee & van der Merwe, 2010; Macher et al., 2013). These studies are however based on smaller samples (approx. 100–200 students) while those in favor of males in parts have large datasets with more than 1,000 introductory students. No study was found that documented differences in the expectancies for success in favor of female students. The picture is less consistent as regards value appraisals (i.e., interest and utility value). For utility value, there is an equal number of studies that found differences in favor of male students (Hannigan et al.,

100

4

Further Contextualization of Motivational and Emotional …

2014; Hommik & Luik, 2017; Opstad, 2020; Tempelaar & van der Loeff, 2011) and studies that found no significant differences (Bechrakis et al., 2010; Coetzee & van der Merwe, 2010; Schau, 2003; Tempelaar et al., 2006). A minority of studies found utility value related differences in favor of female students (e.g., Rodate-Luna & Sherry, 2008). For statistics-related interest value, by contrast, the majority of studies again points to male students reporting higher manifestations (Hannigan et al., 2014; Hommik & Luik, 2017; Opstad, 2020; Soe et al., 2021; Tempelaar & van der Loeff, 2011). Only one study was found that documented higher mean interest for female students (Tempelaar et al., 2006). There is an inverse pattern for the final remaining construct in the SATS model; the willingness to invest effort was unanimously found to be higher on part of female students (Hommik & Luik, 2017; Hannigan et al., 2014; Opstad, 2020; Tempelaar & van der Loeff, 2011). In sum, there is a a relatively clear picture for expectancies for success, statistics anxiety, and interest (higher for males), and effort (higher for females). Utility value for introductory university students was less consistent across genders. Taking the above-mentioned differences from secondary school into account, there is however also a tendency to the benefit of male students. The higher perceived effort could be due to the overemphasis of male accomplishments, leading females to assume to make extra efforts to succeed. Effect sizes for the relationship between gender and EV appraisals were usually small or, occasionally, moderate. Considering that there is no consistent performance differential despite the more pronounced attitudinal differences, it could be assumed that potentially adverse effects of female students’ negative attitudes are outweighed by efficient coping mechanisms, learning strategies, and resource management (Macher et al., 2012, p. 495; Rodarte-Luna & Sherry, 2008, p. 335). Compared to the EV appraisals, gender-related differences in achievement emotions are far less researched and rather ambivalent (Conrad, 2020, p. 47). Analogous to CV theory, the gender-related patterns of CV appraisals, mostly to the detriment of female students, can be expected to translate into debilitating emotional manifestations (Frenzel, Pekrun, et al., 2007, p. 500; Loderer et al., 2020). From the underlying CV relationships, it is tentatively assumed that female students experience less enjoyment and more hopelessness in statisticsrelated coursework (i.e., due to their above-mentioned, lower CV appraisals). Emotional differences may vary across domains. For instance, female students were generally reported to experience more enjoyment than males in languagerelated courses such as English (Ismail, 2015). Therefore, the following empirical findings will always refer to the respective domain context of the study.

4.2 The Prevalent Gender Differential in Statistics-related Motivational …

101

In line with gender differences in test anxiety, females enrolled in different university courses experienced greater average hopelessness while enjoyment did often not significantly differ across gender (Pekrun et al., 2004, p. 300). Tempelaar et al. (2017) and Starkey-Perret et al. (2018) collected data from approx. 1,000 students of a blended introductory quantitative course, and 300 middle school students, respectively. Both studies found significant gender differences in hopelessness to the detriment of female students, but not for enjoyment. Conversely, Curelaru and Diac (2022, p. 58) collected data from 365 first-year students of different courses (e.g., natural sciences and arts and humanities), where female students reported significantly higher enjoyment and differences remained insignificant for hopelessness. Similarly, another study from Pekrun et al. (2011, p. 42) with undergraduate psychology students showed that females experienced both course enjoyment and learning-related anxiety more intensely than their male counterparts. Harley et al. (2020) found that female students at a faculty of education are more likely to be clustered in a negative emotional profile (including hopelessness) compared to male students. Riegel & Evans (2021) investigated the impact of frequent online assessment on achievement emotions with a sample of 400 undergraduates in a service mathematics course and found no significant gender-related average differences for neither enjoyment nor hopelessness around the assessments. Moving from the university to the secondary school context, Buri´c (2015, p. 792) found that female middle school students experience stronger intensities of hopelessness compared to male students in mathematics (enjoyment was not assessed in the study). Frenzel, Pekrun, et al. (2007) assessed mathematicsrelated enjoyment and hopelessness of more than 2,000 5th grade students and found that girls reported less enjoyment, more anxiety, and more hopelessness compared to male students despite similar mathematics grades. The authors also conducted multi-group analyses that showed that the structural relationships between achievement emotions and a midterm grade were mostly invariant across gender despite the documented average differences. This finding is however only transferable to a limited extent to the present study due to the sample of 5th graders and the summative operationalization of prior achievement. In all, while gender-related differences in enjoyment appraisals are rather inconsistent, hopelessness was mostly consistently reported to be higher for female students. Female students are seemingly more prone to negative emotions compared to male students. These negative manifestations could be ascribed to additional struggles that females undergo in originally male-dominated fields (Marshman et al., 2018). These different manifestations might indicate that women also respond differently

102

4

Further Contextualization of Motivational and Emotional …

to instructional measures, such as feedback in relation to tasks in the statistics domain, for which women are assumed to be affected by societal beliefs.

4.2.2

Gender-related Differences in Feedback Processing in Theoretical and Empirical Research

Applying the gender-related findings on the reception of feedback, it has to be considered that female students tend to have averagely lower statistics-related EV appraisals and a more negative emotional attitude compared to male students. These negative manifestations could potentially detract from the degree to which feedback is suited to facilitate gains in AME appraisals. This suggests that female students, even upon receiving predominantly positive feedback tend to process it in a more skeptical, less confident, and reserved way compared to male students. Another reason for a gender-related differential in the internalization of feedback could lie in the female attribution bias (Marshman et al., 2018). Stipek and Gralinsky (1991) found empirical evidence from female middle school students rather attributing failure to low ability while success was not attributed to high ability. This incongruity particularly applies to masculinedominated tasks (Schunk & Lilly, 1984, p. 204). Assuming that female students view themselves as less competent in statistics, unforeseen successes are likely not attributed to inherent ability, but to luck, for instance (Marshman et al., 2018). Pajares (2002, p. 118) refers to these different mindsets as more over-confident, self-congratulatory for male students whereas females are more modest and pessimistic in their evaluations. Relating to the stereotype threat, Korman (1970) formulated a self-consistency theory that accounts for self-fulfilling prophecies of societal stereotypes. According to this theory, individuals approaching a task with low expectancies are more likely to reject the feedback or process it more unfavorably—even when they succeeded in performing the task—in order to adhere to the expectations that the cultural stereotype would predict. The assumption that female students should struggle more in statistics and are less expected to aspire for statistics-related careers may thus attenuate the EV-enhancing quiz feedback effect for female students in order to remain consistent with the stereotype. The more unfavorable motivational appraisals of feedback may further promote pessimistic emotional processing on part women, as was documented in the studies above. Regarding the effort construct, i.e., the most proximal achievement-related behavior, female students consistently reported higher levels compared to male students (see above). Accordingly, in some statistics courses, female students

4.2 The Prevalent Gender Differential in Statistics-related Motivational …

103

were found to invest more personal resources, such as time and effort in their studies and were more likely to complete bonus exercises (Fischer et al., 2013; Macher et al., 2012; Ramirez et al., 2012; Tempelaar & van der Loeff, 2011). From this follows the assumption that female students may rather draw on supplemental offerings as they think to have knowledge gaps resulting in a permanent desire to clear backlogs in learning. By analyzing concrete participation rates in a quasi-experimental study, one study also documented that female students participate more frequently at voluntary quizzes, but profit less in terms of knowledge gains than male students (Förster et al., 2018). Due to the stronger reliance on diligence and effortful processing of task, female students might believe that their success is not a reflection of their ability, but of their mere effort and therefore gain less expectancy of success from their achievements compared to students who succeed in the wink of an eye (Schunk, 1991). In all, it could be expected that female students profit less from the EV-enhancing and feedback effect while also being tempted to process negative feedback more negatively compared to male students (Marshman et al., 2018; Stipek & Gralinski, 1991, p. 367). By contrast, the relationship between effort and formative achievement is expected to be stronger for female students. There are only few empirical studies on gender-related motivational processing of feedback during knowledge acquisition. Nevertheless, taking a look at some of these studies may help to further sharpen the above assumptions. Malespina and Singh (2022) investigated the differential gender impact on university students’ outcomes in high-stake and low-stake assessments. By means of mediation models, they found a significant performance difference on high-stake performance to the detriment of female students, which could be ascribed to gender differences resonating in self-efficacy and test anxiety. For the low-stakes quizzes, gender was an insignificant predictor, so that there was no motivational gender difference to account for (Malespina & Singh, 2022, p. 8). This finding suggests that the moderating effect of gender on the relationship between appraisals and performance may be less relevant for low-stakes assessments. Some researchers also investigated the moderating effect of gender on the efficacy-performance relationship. Despite the consistently reported average gender differences in self-efficacy, most studies found EV-performance linkages in mathematics to be gender-invariant (meta-analytically: Valentine et al., 2004; for middle school: Marsh et al., 2005; for primary school: Seegers & Boekaerts, 1996). Randhawa et al. conducted structural equation models on a sample of high school seniors to model mathematics achievement and found that the path coefficients of efficacy to achievement were stronger for boys (1993, p. 47). Schunk and Lilly (1984) found that attributions of self-efficacy only differed between male and female middle school students only

104

4

Further Contextualization of Motivational and Emotional …

before solving a mathematical task, but not after targeted instruction and reception of performance feedback. As regards interest value, Krapp (1995, p. 754) points to a moderating effect of gender on the relationship between interest and academic performance across school types and grades, whereby higher correlations between interest and academic performance are generally observed for male learners than for female learners. Taking everything into consideration, theory suggests that male and female students may process information differently on grounds of cultural stereotypes, which also manifests itself in the mostly empirically lower AME appraisals on part of female students (except for effort). When considering empirical findings on feedback reception or on the appraisal-performance relationships mostly no significant moderation according to gender was found. These studies, expect for Valentine et al.’s meta-analysis (2004), however, were conducted on middle school or even primary school students and may therefore lack transferability to the higher education context.

4.3

The Shaping Role of Statistics-related Prior Knowledge in Motivational and Emotional Appraisals

4.3.1

Expertise-related Mean Differences in Motivational and Emotional Appraisals in Theoretical and Empirical Research

Students systematically evaluate their prior abilities by means of positive dimensional comparison processes with other contrasting or assimilating domains (Guo et al., 2017, p. 81). Prior mathematical or statistics-related knowledge has been found a significant predictor of subsequent cognitive and affective learning (Opstad, 2020; Ramirez et al., 2012). Although some researchers claim that domain-specific prior knowledge is a relevant determinant of feedback reception and its proper contextualization (Bangert-Drowns et al., 1991; Lipnevich et al., 2016, p. 180; Narciss & Huth, 2004), little empirical research has been devoted to the moderating influence of varying expertise levels (Li, 2010, p. 349). Research in statistics education has mostly shown that higher proficient students, i.e., students with quantitatively or qualitatively more experience in related mathematical or statistical domains, perform better in subsequent statistics courses (Cashin & Elmore, 2005; Chiesi & Primi, 2010; Macher et al., 2013; Onwuegbuzie, 2003; Wisenbaker, 1999). Since more experienced students are assumed

4.3 The Shaping Role of Statistics-related Prior Knowledge in Motivational …

105

to perform better, and since achievement is assumed to positively correlate with subsequent AME appraisals, they might profit more from the feedback compared to lower achieving students. To shed light on these interdependencies and analogous to Section 4.2, an approximation of expertise-related differential functioning via average differences in statistics EV appraisals takes place in a first step. Dimensional comparison theorizes that students’ ability comparisons in two similar domains also impact the manifestations of intrinsic value. Hence, assuming that higher proficiency positively relates to value appraisals (H3) and given that students perceive both domains to be similar, intrinsic values attached to mathematics domain would likely generalize to statistics as well (Guo et al., 2017, p. 89). A majority of studies operationalized statistics-related prior knowledge through the number of passed mathematics college courses (Macher et al., 2012; Ramirez et al., 2012) while other researchers used a competence-related test to assess students’ prior mathematics background (e.g., Chiesi & Primi, 2010) or a math major (Tempelaar & van der Loeff, 2011). Gal and Ginsburg (1994) argue that prior experiences with mathematics loom large when starting a statistics course because the subject is assumed to consist largely of mathematics content. This association is related to the fact that beginning students lack a proper a proper preconception and base their anticipations on their math experience (Gal & Ginsburg, 1994). Most studies found positive associations with statisticsrelated EV appraisals. Similar to the gender-related mean differences, the findings are most consistent concerning self-efficacy and statistics anxiety. Students with a higher prior knowledge have a higher self-efficacy and a lower anxiety compared to lower prior knowledge students (e.g., Carmona, 2005; Cashin & Elmore, 2005; Coetzee & van der Merwe, 2010; Macher et al., 2012; Onwuegbuzie, 2003; Sharma & Srivastav, 2021; Soe et al., 2021; Stone, 2006). Regarding statistics-related difficulty, fewer studies, but still the majority, point to a positive relationship with proficiency, i.e., students with more prior knowledge have a lower difficulty appraisal (Cashin & Elmore, 2005; Soe et al., 2021; Tempelaar & van der Loeff, 2011). Fewer studies found no significant interrelation (Carmona, 2005; Stone, 2006). Statistics-related utility value was also fairly consistently higher for more proficient students (Baloglu, 2003; Cashin & Elmore, 2005; Tempelaar & van der Loeff, 2011; Stone, 2006) with some insignificant findings (Carmona, 2005). Statistics-related interest and effort in the context of the SATS model have only been investigated in few studies with no significant expertise-related differences (Tempelaar & van der Loeff, 2011; Soe et al., 2021). Analogous to the gender-related differences, the expertise-related emotional response to feedback information is assumed to be shaped by the CV appraisals

106

4

Further Contextualization of Motivational and Emotional …

(Peterson et al., 2015; Loderer et al., 2020). Based on the findings that low prior knowledge students have, by and large, weaker control and value appraisals, it is assumed that they also experience less enjoyment and more hopelessness compared to high proficient students (Parr et al., 2019). Empirical findings corroborate this derivation. To begin with, a systematic review of 186 studies on achievement emotions in technology-based environments, Loderer et al. (2020) found that prior knowledge was positively correlated with enjoyment and negatively with anxiety. Other studies found that GPA was not a significant predictor for achievement emotions before an assessment (Daniels et al., 2008), but that more immediate factors were responsible for the shaping of emotional appraisals. Ahmed et al. (2013), Frenzel, Pekrun, et al. (2007), and Goetz et al. (2007) documented that higher able middle school students (i.e., 6th and 7th graders) experienced more enjoyment, while low ability students experienced more hopelessness and anxiety around an assessment situation. In some other studies, however, achievement emotions were not related to high school grades (e.g., Daniels et al., 2008, p. 595 with a large sample of introductory psychology courses). From the average differences in AME appraisals, predominantly to the benefit of higher proficient students, the question arises whether they also process feedback in more favorable ways.

4.3.2

Expertise-related Differences in Feedback Processing in Theoretical and Empirical Research

As elaborated above, high proficient students are expected to have higher expectancies for success to complete a statistics-related task compared to low proficient students on average (Cai et al., 2018). In that regard, attribution theory serves to explain how high and low proficient students might differently approach novel tasks, resulting in differences between prior appraisals on subsequent achievement. High-proficient students are expected to approach novel tasks more optimistically because they feel competent in that area (Lipnevich et al., 2016, p. 180). and attribute success to their own higher ability and failure to a lack of effort rather than to a reflection of their true ability when receiving feedback (Black & Wiliam, 1998, p. 24; Hurley, 2006, p. 441; Peterson et al., 2015, p. 84). Moreover, high-proficient students were found to engage in formative assessments following a mastery approach in order to optimize their skills and diligently deal with learning opportunities (Azzi et al., 2015, p. 413). By contrast, less proficient, failure-prone students tend to attribute failure to low ability and may not appreciate tasks at which they are less experienced and potentially

4.3 The Shaping Role of Statistics-related Prior Knowledge in Motivational …

107

performing poorly (Black & Wiliam, 1998, p. 24; Lipnevich et al., 2016, p. 180). Therefore, they may approach new tasks more reservedly with the intention to avoid failure rather than embracing success (Hurley, 2006, p. 441; Lipnevich et al., 2016, p. 180; Peterson et al., 2015, p. 84; Zeidner et al., 1998, p. 178). Moreover, students with a higher prior knowledge are expected to be more accurate in their self-assessment (Narciss, 2008, p. 130; Peterson et al., 2015, p. 85) because they can recur on a greater quantitative and qualitative knowledge base to pinpoint their actual level of knowledge. Hence, it can also be expected that higher proficient students’ prior expectancy appraisals correlate more strongly with subsequent achievement. By contrast, less proficient students might fail to appropriately assess their current skill level or tend to have an overconfident selfimage (Zimmerman & Moylan, 2011, p. 308), at least at the beginning of a course or before entirely novel tasks. From this it follows that the relationship of higher expectancy appraisals on quiz performance could be attenuated for low-proficient students because they fall short of their own expectations. Expertise-specific attributions also suggest different effects of feedback on subsequent expectancy appraisals. Since high proficient students tend to attribute success to their own ability, feedback of any type (i.e., positive or negative) is more welcomed because it is seen as support to build up further confidence (Lipnevich et al., 2016, p. 180). Higher proficient students tend to naturalize mistakes as challenges and thus better see through the learning opportunities that resonate within the feedback information (Adams et al., 2019, p. 319; Hurley, 2006, p. 441). For less proficient students, more likely construing failure a threat for their self-esteem, more or less negative feedback can however further intensify the low expectancies and result in academic paralysis (Cai et al., 2018). Lacking experiences of prior success, might therefore render weaker students more resistant toward feedback (Lipnevich et al., 2016, p. 180), so that the expectancy-enhancing feedback effect might be toned down. Accordingly, lower proficient students were also found to be less susceptible to instructional interventions (Zimmerman & Moylan, 2011, p. 308). The above assumptions all suggest that high proficient students receive feedback of any kind more favorably whereas expectancy-feedback relationships are toned down for lower proficient students. In a meta-analysis of 22 studies on the efficacy of written feedback in L2 acquisition, Kang and Han found that intermediate and advanced learners profited most while novices did not seem to benefit at all (2015, p. 9). Further impulses for the role of prior knowledge in feedback reception come from cognitive load theory (Paas et al., 2003; Shute, 2008), which is concerned with the efficient allocation of cognitive resources under limited working memory capacities (Sweller et al., 2011). Cognitive load theory postulates that the

108

4

Further Contextualization of Motivational and Emotional …

provision of more detailed feedback frees up cognitive resources (Kelley & McLaughlin, 2009, p. 1701). Because novel tasks in new domains, such as statistics, tend to overwhelm the working memory (Paas et al., 2003), external and detailed support is more beneficial for novice learners than for such learners who can already draw on their existing prior knowledge (Kalyuga et al., 2003). According to the empirically supported expertise reversal effect (Kalyuga & Sweller, 2004; Smits et al., 2008), instructional measures that are effective for beginners may therefore be ineffective or even debilitating for more experienced learners (Fyfe et al., 2012, p. 1105). From the cognitive load theory, it can be inferred that feedback elaboration and arrangement should depend on learners’ proficiency (i.e., fluid resources, cognitive resources, reasoning ability; Kelley & McLaughlin, 2009, p. 1701). Based on this assumption, lower proficient students, who cannot recur on a profound subject-related prior knowledge, profit more from “studying the solution” by means of elaborated feedback with explicit guidance because they point them more explicitly to their conceptual mistakes (Day et al., 2018; Ffye et al., 2012, p. 1105; Mason & Bruning, 2001; Shute, 2008). By contrast, high proficient students are likely to profit more from a simple knowledge of the correct results feedback because they can tap into their higher cognitive resources and prior knowledge to discern errors by themselves (Kang & Han, 2015, p. 2; Mason & Bruning, 2001). Hence, they can autonomously draw their own inferences from unelaborated feedback while simultaneously being challenged to pursue the information for learning purposes to self-generate feedback (Day et al., 2013, p. 919; Kelley & McLaughlin, 2009, p. 1701). Concerning the context of the present study, since every student regardless of his proficiency level receives the same type of corrective, unelaborated feedback, it could be assumed that lower proficient students benefit less from the motivation- and emotion-enhancing impact compared to higher proficient students. As Lipnevich and Panadero put it, learners have to understand the feedback first in order to be able to beneficially use and process the information (2021, p. 6). Accordingly, the degree to which a student deliberates a message depends on his ability to process the feedback (Acee & Weinstein, 2010, p. 489). Assuming that novice students lack such relevant knowledge and ability may render feedback futile in terms of cognitive, motivational, and emotional development. Eventually, low proficient students might not fully appreciate or access the benefit of the feedback messages and their implications (Adams et al., 2019, p. 318). In contrast to this, high proficient students, assumed to be more knowledgeable, feedback literate and skilled in

4.3 The Shaping Role of Statistics-related Prior Knowledge in Motivational …

109

self-regulation, can use the meaning of their own results to further enlarge their knowledge and build up EV appraisals (Hammad et al., 2020, p. 1503). In accordance with CV theory, it is assumed that higher proficient students, based on their higher CV appraisals, also profit more from the emotion-enhancing feedback effect. Hence, if an assessment task is related to a negative motivational appraisal for low proficient students, and they are unable to control this incongruity, this might aggravate their emotional experience. Only one study could be found that investigated the interrelations between feedback and achievement emotions contingent on prior knowledge. Based on latent curve modeling with undergraduate general education students, Peterson et al. (2015) found that only high GPA students appreciated the provision of feedback in such a way that it elicited positive and reduced negative emotions (i.e., enjoyment, hopelessness; Peterson et al., 2015, p. 93). Even though these findings are in line with the above assumptions, and support the reinforcing impact of higher proficiency, they have to be taken with a grain of salt because further empirical corroboration is lacking. Regarding the relationship between self-perceived effort and feedback, the assumption of a specific moderation effect according to prior knowledge is difficult. On the one hand, it could be assumed that being educated at higher levels reduces the need to mobilize a high amount of energy for novel tasks (Tempelaar & van der Loeff, 2011)1 . However, some studies have also shown that less proficient students gain significantly more in terms of performance when they invest a comparable amount of effort in voluntary tasks or quizzes (Förster et al., 2015; Self, 2013). Hence, the relationship between proficiency and effort remains ambiguous and will therefore be tackled exploratorily. In all, the moderating impact of prior knowledge on feedback reception are under-researched empirically in statistics education and other domains. Based on theoretical approximations with few empirical substantiations, it is assumed that higher proficient students profit more from the reciprocal linkages between feedback and AME appraisals compared to less proficient students.

1

In the present study, effort is construed as willingness to achieve in terms of invested study time. Had the effort construct been rather construed as a measure of persistence in the face of challenge, a positive association with prior knowledge would have been a more logical inference.

110

4

Further Contextualization of Motivational and Emotional …

4.4

The Flipped Classroom as a Potential Catalyst of the Motivational and Emotional uptake of Feedback

4.4.1

Defining Characteristics of a Flipped Classroom

The reciprocal causation of the postulated theoretical models provides different leverage points for treatments of AME appraisals (Fulton & Fulton, 2020; Pekrun & Linnenbrink-Garcia, 2012). By targeting any of the elements in the postulated recursive feedback model, the predictive relations are assumed to change on grounds of their reciprocal linkage with their antecedents over time. Changing the learning environment from a traditional (TC) to a flipped classroom (FC) should therefore affect the complete feedback mechanism. Blended learning, an approach to education that combines online educational materials with placebased TC methods, may be a promising approach to foster students’ achievement motivation and emotion as it tackles their basic needs for autonomy, relatedness, and competence (McKenzie et al., 2013; Murphy & Stewart, 2015). The FC is a type of blended learning in which the phases of knowledge transmission, application, and practice, are swapped. Students first acquire the basic knowledge autonomously outside the course with the help of learning materials provided, such as pre-recorded educational videos, interactive visualizations, selected texts, and tasks. Hence, learning activities involving the necessary understanding are transplanted outside the classroom to make room for more challenging tasks fostering deeper knowledge inside the classroom (Abeysekara & Dawson, 2015; Ranellucci et al., 2021). Based on the knowledge acquired off-campus, the inclass focus is on the active and collaborative processing of complex tasks, and their discussion with support from peers, tutors, and instructors. Following constructivist theories, the instructional design requires students to recur on and explore the knowledge they gained from the educational videos during student-centric, hands-on exercises and discussions in the f2f sessions with their peers (van Alten et al., 2019, p. 2). In this acquisition process, students’ prior knowledge and intuitions are dynamically formed and re-formed, building relations with previous knowledge and conceptions, thus fostering deeper conceptual understanding through re-exposure (Banfield & Wilkerson, 2014, p. 291; van Alten et al., 2019, p. 2). Hence, they are enabled to reflect on their own understanding and test their skills under the guidance of their instructor. In TC, the focus is on knowledge transfer, i.e., on the less demanding levels “remember” and “understand” (Anderson & Krathwohl, 2001), while the higher levels usually have to be passed by without support (Weidlich & Spannagel, 2014, p. 237).

4.4 The Flipped Classroom as a Potential Catalyst of the Motivational …

111

In contrast, in the FC, the content (remember and understand) is transmitted in videos the more complex learning processes (apply, analyze, evaluate, and create) are addressed in the f2f sessions (Weidlich & Spannagel, 2014, p. 239). Considering the propositions of the environmental dimension of the CV framework (e.g., feedback, autonomy support, cognitive quality), it could therefore be assumed that flipped teaching aligns well with beneficial instructional practices that enhance AME appraisals.

4.4.2

Theoretical Evidence on Motivational and Emotional Feedback Processing in Flipped Classrooms

In their encompassing review of feedback models, Panadero & Lipnevich conclude that the presentation of feedback is the least studied (both theoretically and empirically) and only growing recently as a result of the increasing exploration of various instructional delivery modes (2022, p. 4). Even though quizzes are considered an integral part of the Inverted Classroom Mastery Model (Handke, 2020), they were only included to a smaller portion in research studies compared to other out-of-class activities (e.g., video lectures, homework, learning modules; Bishop & Verleger, 2012). The agentic use of feedback is thereby assumed to depend on the instructional contexts in which feedback is embedded (Panadero & Lipnevich, 2022, p. 6). Regarding the feedback reception in relation to the course design, even well-designed performance feedback might not unfold its full effect if transmitted in an unfavorable learning environment. To put it crudely, an inappropriate learning environment will attenuate potential EV-enhancing effects even of a cleverly designed task (Panadero & Lipnevich, 2022, p. 14). Hence, the environment in which the feedback is delivered matters as well (Lipnevich et al., 2016). As a first step, the main benefits of FC versus TC designs on achievement motivation and emption are discussed and subsequently related to the feedback reception process. According to the theoretical model elaborated in Section 4.5, the supply-use model emphasizes the importance of external parameters, such as the instructional setting while the CV theory postulates that these conditions of the learning environment impact CV appraisals, which in turn are received emotionally and then translate into effort and performance. Using this process logic for FC and TC, a first tentative implication is that a variation in control and valence attached to certain features of the TC or FC induces positive emotionality. Along the lines of SDT, there is considerable amount of research documenting that students have higher levels of motivation and persistence when the classroom

112

4

Further Contextualization of Motivational and Emotional …

climate promotes the basic psychological needs of competence, autonomy, and relatedness—as compared to more controlling and pressuring climates (Baker & Goodboy, 2019, p. 82; Deci & Ryan, 2016, p. 17; Griffin, 2016, p. 117; Pekrun & Linnenbrink-Garcia, 2012, p. 276). Autonomy reflects the degree of volitional control and freedom of choice that individuals have over their own learning. Competence refers to experiences of individual mastery and the belief to achieve the desired goals. Relatedness entails experiences of recognition and connectedness, stemming from the need to interact with others2 (Deci et al., 1996, p. 172; Deci & Ryan, 2016, p. 15; Chans & Castro, 2021). SDT thus serves as a rationale for predicting in how far contextual and instructional factors address these three basic needs and thus might induce positive changes in students’ expectancies for success, intrinsic motivation, and emotional appraisals (Abeysekara & Dawson, 2015; Deci et al., 1996, p. 172). As regards autonomy, instructional measures that function as extrinsic inducement (e.g., mandatory, or heavily incentivized exercises with deadlines) tend to be perceived as controlling and diminish feelings of autonomy along with intrinsic motivation (Baker & Goodboy, 2019, p. 81; Deci et al., 1996, p. 172). According to a study that ranked preferred ways of teaching, whole-class discussions, videoclips, multimedia use, cooperative learning are favored whereas listening to a lecture and drill-and-practice sessions are rather unpopular (Jang et al., 2016, p. 690). Accordingly, a teacher-centric lecture can be perceived controlling in such a way that it emphasizes the intimidating student-teacher power structure, the persuasion of students to think in specific ways, and the depersonalization due to a one-size-fits-all knowledge transmission (Baker & Goodboy, 2019, p. 82). Offering choice from additional learning materials in the FC, which can be received on a voluntary basis (i.e., educational videos, interactive simulations, authentic datasets) and giving students the role of active participants in the f2f sessions, by contrast, follows a more autonomy-supportive approach (Baker & Goodboy, 2019, p. 82; Bouwmeester et al., 2019, p. 119; Griffin, 2016, p. 117). The shift of a considerable amount of learning outside the classroom equally increases individualization because they can study, and (re-)watch the videos while controlling the viewing pace and frequency, whereas TC strongly depend on the instructor’s directives (Hew & Lo, 2018, p. 5; Ranellucci et al., 2021). Authentic learning material in the FC along with a more intensive studentstudent and teacher-student classroom discourse also facilitates the absorption of 2

In what follows, the focus will be placed on competence and autonomy as these needs are most relevant for feedback reception in conjunction with the instructional medium while student-student and teacher-student interactions were not explicitly assessed in the present study due to the large lecture context.

4.4 The Flipped Classroom as a Potential Catalyst of the Motivational …

113

the respective task values to make students see how statistics can be used in the real world (Pekrun, 2006, p. 334). Discussions arising in the collaborative FC sessions seize on students’ preconceptions, give them a voice in generating and wrestling with ideas, communicate validation, challenge their beliefs, encourage them to reflect, and thus place less emphasis on “right” or “wrong” answers but on the practical value of the lesson, which is also expected to alleviate the controlling climate (Ranellucci et al., 2021; Griffin, 2016, p. 123; Parr et al., 2019, p. 4). In educational contexts, it also has to be considered that autonomy generally tends to be undermined to a certain degree due to pressures in anticipation of high-stake summative assessments. Another unfavorable aspect could be the more open structure of the FC, which in combination with high task demands or an obligation to be always prepared could diminish autonomy and eventually, AME appraisals (Bouwmeester et al., 2019, p. 125; Pekrun & Linnenbrink-Garcia, 2012, p. 275). Apart from enabling greater autonomy, the educational videos implemented in FC serve as visual stimuli to spark students’ momentary interest and positive emotionality by bringing up relevant real-world problems related to the respective statistical topic. Following the four-phase model of interest development (Harackiewicz, Smith, et al., 2016, p. 221), this situational interest and initial curiosity (phase 1) can be revisited in the f2f session to maintain the interest while relating it to the tasks to be solved (phase 2). Value connections that students self-generate in the subsequent group work and discussions are moreover assumed to be more profound as compared to external value information provided by the instructor (Harackiewicz, Smith, et al., 2016, p. 224). If students remain receptive over a longer period of time, i.e., by means of the novelty, complexity, activity, and other visual stimuli and activities in the FC, situational interest and state enjoyment can eventually become more sustainable, emerged individual (phase 3) and enduring individual interest (phase 4) in the subject matter (Hulleman et al., 2010, p. 2). Apart from autonomy, the need for competence is another basic desire that is addressed by means of a FC approach. Due to the reduced instructor-centered input compared to a TC, FC students have to be self-initiative and active in order to keep on track with course. In that way, the FC provides challenging and cognitively activating opportunities for students to take charge of, demonstrate, and check their understanding, which is assumed to promote the need for competence and the ownership of one’s own learning (Deci & Ryan, 1994, p. 8; Griffin, 2016, p. 117; Parr et al., 2019, p. 4). Assuming that they are thus more encouraged and stimulated to acquaint themselves with the course materials likely fosters expectancies for success by means of mastery experience particularly at the beginning of the semester (Finney & Schraw, 2003, p. 182). Through the

114

4

Further Contextualization of Motivational and Emotional …

enhanced cooperative learning in the f2f sessions, students in their groups are more likely to watch peers successfully complete tasks or receive mutual peer support and feedback (Finney & Schraw, 2003, p. 182). Several empirical studies have shown that group work positively impacts expectancy appraisals in terms of vicarious learning as students learn how to think and deliberate together (for an overview, see Hammad et al., 2020, p. 1504). It can also be assumed, that corrective input from peers, the tutors, instructors, or the computer-mediated feedback is more efficiently processed if it based on prior knowledge from the preparatory stage. In other words, expectancies for success are likely to be fostered through reiterating the statistical concepts (i.e., in videos, f2f sessions, quizzes, and learning material). This also entails the increased opportunities to process the content at higher cognitive levels, i.e., in the f2f sessions, and thus receiving more accurate feedback on specific misconceptions (Thai et al., 2017). Knowing that the FC likely promotes needs of autonomy and competence, thus stimulating the CV mechanisms, the focus in the following will be narrowed down to the feedback reception in the FC. The provision of feedback most obviously targets the need for competence with the intention to foster perceived expectancies for success and intrinsic motivation (Deci et al., 1996, p. 177; Huang et al., 2019, p. 5; Jacob et al., 2019, p. 1769). Research however found that the efficacy of feedback also depends on its conveyance in an autonomy-supportive environment and in a non-controlling manner (Abeysekera & Dawson, 2015, p. 5). First of all, it could be assumed that students are more hesitant to receive feedback in strongly evaluative and obligatory contexts. Deci et al. argued that the interpersonal context in which feedback is embedded affects students experience of self-determination and whether feedback is perceived as informational or controlling or pressuring (2001, p. 4; Deci et al., 1999; Nichols & Dawson, 2012, p. 467). In a positive climate, such as a FC, an inherently constructive and explorative culture of error is an integral part of classroom dialogue and serve to clarify further learning opportunities according to a growth mindset (Deci et al., 1996, p. 173). In that sense, misconceptions are assumed to carry less weight in a FC as they are part of the learning process, which is expected to render students more receptive for feedback. Consequently, negative feedback could be expected to carry a less detrimental effect on expectancies for success and value appraisals when it is provisioned in an autonomy-supportive way. This is due to the fact that even negative feedback in the FC can be rather perceived as a help to monitor one’s learning and to figure out how to master occurring difficulties in the learning process (Deci & Ryan, 1994, p. 8). Following on from this, autonomy allows students to self-regulate their learning processes and fosters their sense of competence, which in turn should

4.4 The Flipped Classroom as a Potential Catalyst of the Motivational …

115

positively influence their achievement motivation and emotions (Pekrun, 2006, p. 335). It could be assumed that FC students are more reliant on formative feedback due to the withdrawal of immediate and explicit instruction that is associated with their greater autonomy. In order to satisfy their need for competence, they have to verify the appropriateness of their ongoing knowledge construction processes and to self-regulate their AME appraisals within the more discursive environment of the FC (Acee & Weinstein, 2010, p. 492; van Alten et al, 2019, p. 2). By contrast, in TC, students might rather be attuned to their role as passive recipients without an immediate need of skill confirmation. External representations of task demand in the TC and FC could be another reason for individually different reception processes. According to Narciss (2008, p. 131), the basis of an external feedback loop lies in the disclosure of meaningful reference values regarding task requirements within the instructional medium. Since learning in the FC requires to actively obtain information about these external reference values in a series of mental transformations, learners are likely more likely eager for and more receptive to external feedback to spot obstacles that undermine ongoing knowledge acquisition processes earliest possible in the learning process (Narciss, 2008, p. 132). The quiz feedback can thus be construed as external requirement the learner needs to accurately acquire a new skill, which is even more important in the FC (Chuang et al., 2018, p. 66; Dehghan et al., 2020, p. 2513). Bandura and Locke (2003, p. 96) further argue that self-doubts about one’s success might provide an incentive to gain the necessary knowledge. In other words, students could be assumed to be more open for feedback from self-determined actions because they hunger for information to optimize their learning process (Banfield & Wilkerson, 2014, p. 291). Bangert-Drowns et al. (1991) refer to this assumption en passant when explaining the concept of mindful learning. They assume that the need for feedback is diminished when the instruction is uncomplex, rote, or redundant and the feedback consequently receives less mindful attention (Bangert-Drowns et al., 1991, p. 218). Along these lines, an instructor-centered TC, where the teacher spoon-feeds the passive students with the relevant learning material, could be assumed to inhibit mindful feedback processing because internal negotiation processes are short-circuited. By contrast, the FC design requires learners to prepare actively for the f2f session by watching educational videos and pre-class assignments to which the sessions connect without instructor-centric repetition. In accordance with constructivism, confirming and disconfirming feedback on autonomously explored knowledge and developed skills is more likely to stimulate mindful processing. The effect is expected to be particularly strong if a self-assured state of mind is contradicted (Bangert-Drowns et al., 1991, p. 218).

116

4

Further Contextualization of Motivational and Emotional …

Out-of-class activities, such as the educational videos and quizzes, thus resonate with a higher degree of intrinsic value because their preparation and completion increase the understanding of and participation in class-based group works and discussions (Cook & Babon, 2017, p. 3). Such an underlying motivation might help students to value quizzes and other out-of-class activities for more intrinsic motives. Engaging in the FC can however be perceived to be more effortful because it requires students to engage with pre-class and in-class activities to remain up to date with the syllabus (Cook & Babon, 2017, p. 12; Burgoyne & Eaton, 2018, p. 156). In this way, the FC could be an appropriate means to convert students’ extrinsic into intrinsic motivation to seek knowledge for its own sake. Assuming that the FC paves the way for deeper knowledge acquisition during the in-class sessions, a confirmation of the self-acquired, higher-order knowledge by means of (more or less) positive feedback likely leads to a more favorable reception of the feedback and might lead to higher increases in EV appraisals compared to TC (van Alten et al., 2019, p. 3). According to Acee & Weinstein (2010, p. 489) increasing the instrumental relevance of a task also fosters its positive reappraisal and contributes to a more favorable processing. In the FC, preparatory watching of relevance-enhancing educational videos that offer example for industry applications or other connections with the real world might facilitate a reappraisal of the task value, leading to a more meaningful scrutiny and task processing in relation to the previously acquired knowledge (Acee & Weinstein, 2010, p. 491; Baker & Goodboy, 2019, p. 82; Moozeh et al., 2019). Perceptions of quiz feedback as strategic element for regulating learning behavior in the open structure, could rub off on students’ instrumental motivation, i.e., utility value, to put the received feedback information to good use. This might also make them more aware of the worth of knowing appropriate statistical concepts to determine solutions for the provided tasks (Chuang et al., 2018, p. 66). The theoretical rationale elaborated in this subchapter leads to the assumption that the FC strengthens the interrelation between formative feedback and CV appraisals, such that participants of the FC benefit more from the EV-enhancing feedback effects. As regards the impact of the FC on achievement emotions, the CV theory predicts that classroom instruction as a major factor of cognitive stimulation relates to students’ perceived control and value appraisals (Pekrun & LinnenbrinkGarcia, 2012, p. 275). Hence, the assumed stronger interrelations between feedback and CV appraisals in the FC likely translate into stronger associations with positive activity emotions, and eventually, higher engagement (Pekrun & Linnenbrink-Garcia, 2012, p. 276). The flow theory (Csikszentmihalyi, 2014) serves as a theoretical basis to explain differences in enjoyment in flipped courses.

4.4 The Flipped Classroom as a Potential Catalyst of the Motivational …

117

Flow refers to an emotional state of enjoyment and intrinsic motivation that individuals experience when they are holistically immersed in a challenging activity while still feeling in control (Eccles & Wigfield, 2002, p. 113; Shi et al., 2014, p. 117). FCs thereby strongly encourage activity on part of the learners and involve them in the learning process by clearly intertwined proximal goals, hands-on activities and immediate feedback from peers, instructors, and quizzes—challenging them to autonomously extend their pre-existing skills (Huang et al., 2019, p. 6; Parr et al., 2019, p. 4; Shi et al., 2014, p. 117). A higher supply of different visual and cognitive stimuli in the FC may also contribute to foster students’ positive emotions and their ability to see relations among those stimuli (Budé et al., 2007, p. 16). In turn, a greater awareness of the relations between the different concepts might also enrich information processing and contribute to a better integrated knowledge network (Budé et al., 2007, p. 16). Positive emotions are also assumed to foster relational processing, i.e., the learning of coherent learning material, which facilitates the integration of feedback information and accurate retrieval of unpracticed concepts because they are closely interwoven in the knowledge network (Pekrun & Linnenbrink-Garcia, 2012, p. 265). Moreover, assuming that the FC promotes in-class cooperation, such as dialogic instruction has been shown to contribute to the want of social relatedness, which in turn renders the processing of tasks more enjoyable (Pekrun & Linnenbrink-Garcia, 2012, p. 276). Surprisingly and as far is known, the development of hopelessness or frustration have never been theorized systematically in the scope of FC. Intuitively, no clear prognosis can be made indeed. On the one hand, the FC could be a more stressful situation in such a way that it requires more activity in-class and out-of-class. On the other hand, the TC could lead to frustration in case of prolonged and unresolved confusion due to the lack of preparation, didactic reference frames and in-class exchange with peers and the instructor. A stronger claim can however be made for outof-class hopelessness, which is likely to be more relevant in the FC due to the relocation of learning activities to the preparatory stage, thus providing reference materials for students as a support for their autonomous learning process. In all, the relationship between enjoyment and hopelessness and formative achievement is tentatively assumed to be stronger in the FC compared to TC settings.

118

4.4.3

4

Further Contextualization of Motivational and Emotional …

Empirical Evidence on Motivational and Emotional Feedback Processing in Flipped Classrooms

The following subchapter provides empirical evidence for parts of the aboveelaborated theoretical rationales. First of all, findings directly related to a comparison between traditional and flipped learning will be consulted. Afterwards, the scope will be broadened to compare the efficacy of other instructional approaches that also fit in the traditional-flipped-dichotomy and thus allow for similar inferences regarding the impact on AME appraisals. While design-related performance differences between TC and FC yielded a rather inconsistent picture in earlier studies, more recent meta-analyses suggest that students in the FC tend to achieve better. All of these studies also controlled for perceptions of, or satisfaction with, the course designs, which were mostly either similar for both, or in favor of the flipped format. These findings for performance and perceptions were corroborated in meta-analyses (for introductory statistics courses: Farmus et al., 2020; for health education: Hew & Lo, 2018; for mathematics: Lo & Hew, 2021; across domains: van Alten et al., 20193 ), in literature reviews (Lo & Hew, 2017; O’Flaherty & Phillips, 2015, p. 89), and single studies (Chao et al., 2015; Chen et al., 2014; Gilboy et al., 2015; Jeong et al., 2016; Love et al., 2013; Moozeh et al., 2019; Nielsen et al., 2018; Schultz et al., 2014). Some meta-analyses found that the implementation of pre-class quizzes moderated the effect of the FC on performance in such a way that the performance benefit of the FC was contingent on the provision of weekly quizzes on grounds of the testing effect (Farmus et al., 2020; Hew & Lo, 2018). These findings allude to a synergetic effect of flipping the classroom and the provision of regular feedback (Thai et al., 2020). Occasional negative views in student perceptions were mostly related to feeling overwhelmed by the higher number of activities or still being accustomed to more traditional learning environments (Cilli-Turner, 2015, p. 840; Dehghan et al., 2022, p. 2513). The sudden change of an instructional design could also have given rise to Hawthorne effects in that the deviation of traditional practices can either trigger positive or negative effects (e.g., resisting) by the mere perception of it as a new approach (Huang et al., 2019; Lo & Hew, 2017; O’Flaherty & Phillips, 2015, p. 94; van Alten et al., 2019). Next, we shift the lens from the comprehensive overview to a more detailed view into comparative studies with treatment and control groups, starting with 3

Van Alten et al. (2019) found no significant design-specific differences for student satisfaction. This may be due to the heterogeneity of measurement instruments for attitudes and satisfaction and due to the high variety of included studies (n = 114), where very positive and very negative manifestations for either design might have canceled each other out.

4.4 The Flipped Classroom as a Potential Catalyst of the Motivational …

119

the domain of statistics education. Statistics-related EV appraisals generally have been found to be recalcitrant to whole-class interventions (Xu & Schau, 2021, p. 316), so that the consideration of empirical findings helps to better estimate its potential efficacy. Gundlach et al. (2015) compared EV appraisals towards statistics in traditional (web-augmented), flipped, and fully online classroom designs. They found that only statistics anxiety and difficulty differed at the end of both course and were more positive in the fully traditional setting, compared to the online design. The study however has several limitations. First and foremost, students were allowed to select one of the three designs by themselves. This not only led to largely different sample sizes, but also entails a self-selection bias, whereby the individual preferences could obscure the comparison of motivational appraisals across the settings. Moreover, the features of the three courses were not varied on a sound conceptual basis, e.g., TC and FC design included quizzes, but the online part did not. A similar conceptually arbitrary potpourri of designs applies to a study from Pablo and Chance (2018). The researchers compared a simulation-based inference course in a TC and FC format and found no significant differences in pre-post attitude scores, which was likely because both designs were not sufficiently distinct. For instance, even the lecture design involved a considerable active learning component and group work (i.e., two “laptop days” a week). By contrast, the FC did not rely on the use of educational videos, but on pre-class reading assignments. Even though insignificant, flipped students still had a larger increase in average expectancies for success, a larger decrease in anxiety, and a smaller decrease in interest, and utility value compared to the traditional setting (Pablo & Chance, 2018, p. 2). Aberson (2000), Utts (2003), and Ward (2004) found no performance differences between students in blended or web-based versus traditional lecture courses. In Utts’ study, students evaluated the hybrid course more negatively. These studies however only had small samples and unvalidated attitude and performance measures, so that there may have been other differences that have not been captured in these studies (Zieffler et al., 2008). DeVaney (2010) compared statistics-related EV appraisals between an online and on-campus course. Similar to Gundlach et al. (2015), the study documented higher anxiety in the online course, and generally lower motivational appraisals compared to on-campus teaching (DeVaney, 2010, p. 9). Showalter (2021) compared fully online and f2f sections of an elementary statistics course that also included weekly quizzes. They found that, in the f2f sections, statistics anxiety decreased more considerably by the end of the course while online students were more interested at the end of the course (Showalter, 2021, p. 4). However, both delivery methods were only compared descriptively, and the online course

120

4

Further Contextualization of Motivational and Emotional …

was completely asynchronous, which may well be the reason for the increased anxiety levels; i.e., when students had issues, they needed to cope with it for a longer timeframe without further assistance. Nielsen et al. (2018) found that students in a large undergraduate statistics course experienced the flipped variant significantly better than the traditional one. The comparative findings on FCs in statistics education are thus fairly inconsistent and insignificant in more frequent cases, as well as lacking comparability due to methodological shortcomings. On grounds of these inconsistencies, the scope will be broadened to research across subject domains, also including instructional practices similar to the FC. Thai et al. (2017) compared an e-learning, blended learning, flipped, and traditional classroom context, finding that the flipped format was most beneficially related to self-efficacy and intrinsic motivation. Similarly, Bouwmeester et al. (2019) found that students reported higher self-efficacy in a flipped medical classroom measured with a daily online questionnaire. Carlson and Winquist (2011) used a learning workbook approach with regular conceptual statistics exercises in preparation for the weekly class sessions. Compared to a reference dataset, the workbook approach led to more positive manifestations in statistics anxiety and self-efficacy. Herman (2021) redesigned a statistics course by the inclusion of more authentic, application-oriented learning materials and ‘laptop days’, in which students worked in groups to analyze real data. Self-efficacy, difficulty, and anxiety developed more favorably throughout the semester while value and interest decreased insignificantly. Means of the post-test motivational appraisals were also higher than in a preceding traditional course (Herman, 2021, p. 91). Bateiha et al. (2020) compared a teacher-oriented lecture class with a studentoriented active learning class in an introductory statistics course by means of mixed methods. They found stronger increases in self-efficacy and stronger decreases in anxiety for the students in the student-centered class (Bateiha et al., 2020, p. 162). While interest declined in both formats, the loss was more pronounced in the teacher-oriented group. For students in the non-lecture format, the difficulty appraisal was more predictive of the final exam score compared to the teacher-centered course. This finding suggests that students became better acquainted with the aspiration level of their tasks and learning material in the student-oriented approach due to repeated practice and thus gained a higher control over the learning material compared to the teacher-centered group. The quantitative findings from Bateiha et al.’s (2020) study have to be treated with caution because the differences were not significant (maybe due to the small sample size of approx. 30 students per course). The insignificant results were at least substantiated by subsequent qualitative interviews.

4.4 The Flipped Classroom as a Potential Catalyst of the Motivational …

121

Since no empirical studies on achievement emotions in statistics-related TC and FC could be found, these relationships will also be reviewed across domains. Two studies were found that compared flipped to traditional teaching, in which students experienced greater enjoyment in attending and processing exercises in flipped sections (Brugoyne & Eaton, 2018, p. 156; Dehghan et al., 2022). Considering similar instructional approaches, Curelaru and Diac (2022) investigated the relationship between perceived classroom assessment climates and achievement emotions in a university Language and Literature course. They found that low perceived choice and perceived performance-orientation were related to hopelessness. By contrast, perceiving assessments as learning-oriented was positively related with enjoyment, and negatively with anxiety and hopelessness (Curelaru & Diac, 2022, p. 60). The positive relationship between entity theoretical thinking and negative emotions was also corroborated in Tempelaar et al. (2012). Even though the researchers did not relate the study to specific instructional approaches, assuming that the FC promotes choice and learning-orientation, alludes to its potential for evoking positive and diminishing negative emotions. Tempelaar et al. (2012) analyzed a large sample of first-year Business and Economics students regarding their preferences for online learning usage via path modeling. The researchers found that favorable emotions (high enjoyment, less hopelessness) were predictive of students objectively measured usage of an online learning environment (Tempelaar et al., 2012, p. 167). By contrast, favorable emotions were not related to the attendance in the f2f sessions. The higher relevance of emotions in online learning contexts could stem from the higher amounts of visual stimuli and manipulables (i.e., digital practice tools, content management systems, quizzes) as postulated by flow theory. Jacob et al. (2019) compared teacher students’ emotions of student- versus teacher-oriented and found that the latter approach, counterintuitively, fostered more positive emotions, and reduced more anxiety than the student-oriented setting. They conclude that this may be due to the fact that students are still accustomed to traditional teaching (Jacob et al., 2019, p. 1778). Another reason for this finding could be that two different teachers were involved, and the study did not distinguish between their differential influence as regards personality and competence. In contrast to the finding of Jacob et al. (2019), Parr et al. (2019) found that dialogic instruction positively related to enjoyment in a large sample of middle school mathematics students. Seifried & Sembill (2005) conducted a large-scale study at commercial high schools, in which chalk-and-talk classrooms were compared to autonomysupportive classrooms. The autonomous classrooms featured project-based group work with authentic tasks monitored by tutors, who gave regular feedback (Seifried & Sembill, 2005). Students indicated their emotional experience by

122

4

Further Contextualization of Motivational and Emotional …

means of an ecologically valid continuous state sampling method simultaneously during the learning process with individual electronic devices. Students in the autonomous classroom had significantly higher feelings of control and intrinsic motivation at the same level of knowledge compared to students in the TC (Seifried & Sembill, 2005; Seifried, 2003). Similar to Jacob et al.’s (2019) finding for a TC, the variable indicating well-being was significantly higher in the traditional setting. This may be owed to the higher complexity of the autonomous classroom and the lack of direct procedural instruction, rendering the experience less comfortable (Seifried & Sembill, 2005, p. 664). The comfortableness in the TC could have been treacherous since the students may not have been fully aware of their learning process due to lack of practice (Seifried & Sembill, 2005, p. 664). Finally, the variable most proximal to achievement, effort, or student engagement, was occasionally measured as a self-report measure or by means of learning analytics, system logs, or attendance rates. Some studies indicated that students in FC devoted more time to processing learning material or online modules compared to students in the TC (Balaban et al., 2016; Burgoyne & Eaton, 2018, p. 156; Chen et al., 2014), while Lo and Hew (2021) found no significant difference in perceived effort, despite increases in the rates of interaction and participation. In Bateiha et al.’s (2020) study, students in teacher- and studentoriented classes started with higher effort appraisals, but interestingly, the average perceived effort at the end of the active learning condition was lower compared to the teacher-centered condition. This finding contradicts other SDT studies which have shown that autonomy-supportive, activity-oriented environments lead to higher engagement and persistence on part of the students (Baker & Goodboy, 2019, p. 82). In sum, existing theoretical and empirical research on flipped instruction or related teaching practices is quite mixed. This likely stems from the differences in the concrete instructional setup, class activities, course content, as well as the measurement of the target variables and covariates. Meta-analyses on the effects of the FC also still had a considerable between-study heterogeneity (Farmus et al., 2020, p. 323). The additional consultation of similar instructional approaches provides a clearer picture. By tendency, self-efficacy, difficulty, interest, effort, enjoyment, and hopelessness had more favorable manifestations in the flipped or active learning condition. This conforms to the overview provided in Section 4.4.2, which suggested that student perceptions were generally rather in favor of the FC. As regards anxiety, studies in favor of traditional and active learning approaches are roughly on par. The relation between anxiety and course

4.5 Broadening the Feedback Model to Account for Individual …

123

design is likely more dependent on other classroom characteristics, such as learning culture, class size, teacher styles. For instance, by comparing several active learning practices, including group work, in-class quizzes, and think-pair-share, Hood et al. (2021) also found that several practices involving social interaction were more anxiogenic than individual work. The researchers linked this finding to the higher possibility of being judged unfavorably by others (Hood et al., 2021, p. 11). Based on these findings, the relationship between the AME appraisals will be assumed to be stronger and more beneficial in a FC compared to a TC approach.

4.5

Broadening the Feedback Model to Account for Individual and Contextual Differences in the Uptake of Feedback

The theoretical and empirical evidence from Sections 4.2–4.4 has shown that gender, prior knowledge, and course design are highly relevant variables in the context of statistical knowledge acquisition. Hence, these factors will be factored in the theoretical model to verify whether feedback-related reciprocity of achievement motivation and emotions holds true—and can be generalized—or differs across these contexts. The process logic of Helmke’s supply-use model (2007) is thereby used to delineate the interrelations of the recursive learning trajectories, contingent on AME appraisals, as well as the newly added covariates (see Figure 4.1). The learning process originates from the instructional supply, which subsumes the course offerings along with the micro-didactic supportive features, of which the course design and the provided feedbacks are of particular importance for the present research. The instructional supply is assumed to influence students’ usage processes along with their individual interpretation. These usage processes entail the reciprocal CV appraisals in interrelation with formative achievement as depicted above. The individual usage is also influenced by students’ learning potential and cultural and societal characteristics (Conrad, 2020, p. 70; Lazarides & Schiefele, 2021, p. 14). Providing no further operationalized motivational or emotional constructs, Helmke’s supply-use model (2007) depicts the reception of instructional supplies as a black box (Conrad, 2020, p. 10). The integration of the EV theory and CV theory into this framework thereby complements for all relevant factors in a meaningfully differentiated structure—with motivational appraisals, gender, and prior knowledge being originally rooted in the EV theory, as well as instructional

Teaching climate Social setting Autonomy, competence, relatedness Motivational and emotional support

Teaching quality Overall course design Teacher characteristics Study materials Instructional support Feedback Cognitive Activation

Instructional supply H1 & (H2)

Formative (bt-1 x)

H3

Affect

Value Utility value Interest value

Achievement emotions Enjoyment Hopelessness (in-class and out-of-class)

H4

H3

H5

H1 & (H2)

Effort

feedback loops across time

H6 & (H7)

Control Self-Efficacy Difficulty

feedback loops across time

LEARNING AND USAGE PROCESSES

Learning potential: domain-related prior knowledge

Formative (yt)

Cultural and societal characteristics: stereotypes, gender roles, family background, medial influences

Earnings Academic achievement Domain knowledge Skills Interdisciplinary qualifications

4

H4

124 Further Contextualization of Motivational and Emotional …

Figure 4.1 Contextualized FRAME Model according to Supply and Use. (Note. Hypotheses in brackets refer to a negatively assumed relationship)

4.5 Broadening the Feedback Model to Account for Individual …

125

characteristics, such as feedback, and emotional appraisals rooted in the CV theory. By the integration of these theoretical approaches, the model accounts for the contingency of AME appraisals on gender-, expertise-, and design-related factors in an encompassing, processual educational framework starting from the supply through to the individual outcomes (Conrad, 2020, p. 17). The relations depicted in the model will be investigated empirically from the inside to the outside, beginning with the relationship between formative feedback in the one hand, and AME appraisals on the other hand. The generalizability of these relationships will then be tested by means of moderator analyses for gender, prior knowledge, and course design.

5

Empirical Basis

5.1

Analytical Method of Autoregressive Structural Equation Modeling

The elaborated theoretical model stands out due to the reciprocal linkages between formative achievement and motivation and emotion over time. Therefore, the longitudinal analytical model of latent autoregressive structural equation modelling will be used to operationalize the relevant constructs in measurement models and set them in relation with each other in subsequent structural models. Longitudinal research aims at examining stability and variation in psychological, emotional, and behavioral traits over a certain timeframe. Linear structural equation models (SEM) are chosen for the present investigation because they can model the relationship between a larger number of variables of the postulated theoretical model while accounting for measurement error (Geiser, 2013, p. 81). In order to model situation-specific fluctuations of AME appraisals within the scope of one semester, a latent-state change model will be used (Geiser, 2013, p. 81). The purpose of such change models is to analyze changes of specific characteristics and psychological constructs over time. More specifically, latent autoregressive cross-lagged states models within the framework of path analysis will be estimated, which control for the stability of interindividual differences over time by factoring in earlier observations as predictor for future behavior plus time-specific variance (Curran & Bollen, 2006, p. 209; Geiser, 2020, p. 161). The aim of these models is to account for the unstable part of the change in interindividual differences that is not covered by the autoregressive effects by means Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-658-41620-1_5.

© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 A. Maur, Electronic Feedback in Large University Statistics Courses, https://doi.org/10.1007/978-3-658-41620-1_5

127

128

5

Empirical Basis

of other preceding variables or constructs (i.e., cross-lagged effects). The models were setup and analyzed in the software Mplus Version 8.7 (Muthén et al., 2011).

5.2

Underlying Circumstances of the Data Collection

5.2.1

The Traditional and Flipped Course Frameworks

For a better understanding of the assumptions within the analysis model, both course designs and their underlying teaching practices will be described first. The data were collected in a more traditional introductory statistics course (summer term 2017) and a FC variant of the same course by the same professor (summer term 2018) at the faculty of law, management, and economics at a German university. Both the traditional and flipped course consisted of lectures of 135 minutes length in a large lecture hall with up to 700 students and a weekly tutorial of 90 minutes length in smaller groups of approximately 30 students. The f2f lectures in the flipped course took place biweekly only to compensate for the additional self-learning time (reducing the seat time from 5 to approximately 3,5 semester periods per week in the 13-week semester). In the traditional statistics course, the lecturer usually guided the students through linearly structured lecture slides while the students remained passive and took notes. Interaction in the traditional lecture was limited to the lecturer encouraging students to ask questions. Exemplary tasks were occasionally included in the lecture, which the lecturer reckoned up for the students. The lecturer gave further elaborations on exemplary calculations and related them to previously taught statistical concepts. The tutorials of the traditional lecture were similar in such a way that tutors, instead of the lecturer, reckoned up various statistical tasks from work sheets while students wrote down the solutions. In both courses, four electronic quizzes were implemented to give students formative feedback on their learning progress. While electronic quizzes had already been used in the summer term 2017, the flipped course in 2018 placed a much stronger focus on further self-learning material and restructured the f2f lectures and tutorials to adhere to the FC approach. First and foremost, a set of 36 educational videos with an average length of 10 to 13 minutes had been produced and specifically tailored to the curricular contents of the former lecture. The students were supposed to infer the relevant statistical content from the conceptual learning videos, which create a problem that is close to the real world, dynamically explain the necessary statistical procedures, and focus on transmitting declarative and conceptual understanding. Each pre-class video concludes by touching on limitations or on points worthy of discussion,

5.2 Underlying Circumstances of the Data Collection

129

which the lecturer would discuss during the subsequent in-class session. The videos are supported by visual material (i.e., animated diagrams, illustrations) and additional visual and auditory cues. Apart from these videos, online statistics tools had been developed, so that students could autonomously interact with data, distributions, and graphs in real-time to figure out different ways of meaningful representation along with its mechanics and interdependencies (Garfield & Ben-Zvi, 2007, p. 388; Klinke et al., 2018). Parts of the contrived data examples were exchanged with real data (e.g., from micro-census or the Federal Office of Statistics) or with self-generated data from the participants (e.g., via Google Forms). The aim was to render the data more easily relatable to everyday life to motivate students to interact, analyze and reflect on the findings to receive a deeper understanding (Garfield & Ben-Zvi, 2007, p. 388). The key difference between the new f2f lectures and the traditional lectures is that students were expected to come prepared as the instructor did not repeat any conceptual and declarative content that had already been transmitted in the educational videos. Only if prepared, students could participate actively in the f2f lecture, pinpoint their own misconceptions during task processing, directly ask the lecturer targeted questions, and assimilate new insights into their existing knowledge structures. The lecturer consequently refrained from using a controlling language as regarding the preparatory learning materials to avoid a controlling climate not following SDT and thus potentially undermining motivation (see Section 4.4.2). With the previous knowledge of the videos, students solved statistical in-depth problem sets in small groups that required more intensive computations and application of the previously learned basic concepts. These exercises did not focus on the algorithmic procedures as they have already been learnt in the videos. Instead, these tasks encouraged students to adequately interpret, scrutinize, and discuss the relevant statistical concepts. Task processing followed the think-pair-share approach, so that students should first deal with the task individually and relate them to the previously received educational videos. In pair or group work, students then discussed their potential solution approaches and then solved the task collaboratively. Meanwhile, the instructor and two tutors walked around the hall to answer occurring questions. Results from the group work were critically discussed at the end of each session and contextualized in regard to preceding and succeeding topics of the course. In addition, audience response systems and smartphones were integrated into the f2f lecture for collaborative and competitive voting so that the lecturer and students could determine the extent to which the tasks were understood and completed correctly. The questions ranged from easier, recapitulating, and activating questions to more cleverly posed ones. The latter questions were to misguide students to answer incorrectly

130

5

Empirical Basis

to lay bare common misconceptions and certain limitations of statistical concepts (such as spurious correlation between shoe size and affinity to statistics via gender). In contrast to the f2f lecture, the weekly tutorials were organized as smaller groups, in which students also worked on such exercises. The difference to the 2017 tutorials was that the tutor did not show any sample solutions, but that the students were invited to come to solve the exercises together in groups and ask questions in a less formal context than in the f2f lecture. Figure 5.1 shows the key learning material implemented in the FC design (Förster et al., 2022).

Figure 5.1 Learning Opportunities and their Structure within the Flipped Classroom

Table 5.1 summarizes the opposing features of the 2017 traditional course with those of the 2018 flipped course. Table 5.1 Comparison of the Characteristics of the Traditional and Flipped Course in the Context of this Study Traditional course Learning material for • No educational videos the preparatory and • Four quizzes out-of-class phases • Lecture notes

Flipped course • • • •

36 educational videos Four quizzes Lecture notes Interactive demonstrations, data examples (continued)

5.2 Underlying Circumstances of the Data Collection

131

Table 5.1 (continued) Traditional course

Flipped course

F2F lecture (approx. 700 participants)

• 2,5 h / weekly • teacher-centered • questions occasionally asked (either by teacher or students)

• 2,5 h / biweekly • student-centered, no spoon-feeding of required knowledge • Group work and discussions on problem sets according to think-pair-share • Plenary discussion

F2F tutorial (several courses à approx. 30 participants)

• 1,5 h / weekly • Tutor-centered

• 1,5 h / weekly • Student-centered collaborative work on tasks with opportunity to ask questions

To broaden the data corpus for the longitudinal analyses, both above-described courses will be included in the present study. The next chapter provides a more detailed description on the measures employed to assess students’ cognitive, motivational, and emotional development while going through both courses.

5.2.2

Measurement Instruments

The designed and implemented measures sought to address the research gap of appropriately triangulated data sources from sufficiently validated motivational, cognitive self-ratings to (high-stake), achievement-relevant course data, which came across in various meta studies investigating statistics education and FC teaching (Giannakos et al., 2014; Ramirez et al., 2012). Therefore, administrative, and objective performance data was collected from standardized exams and quizzes, which were then intertwined with the rating data in a longitudinal assessment framework. Regarding students’ AME appraisals associated with learning statistics, sufficiently validated, and widely used measures were implemented (Pekrun et al., 2011, Xu & Schau, 2021; see Section 5.2.3). Following the recommendation of Eccles and Wigfield (2002, p. 119), the measurement instruments are related to the context of learning statistics to account for the domain specificity of state attitudes and emotions and lend them more predictive power (see Section 3.2.2). The English-language surveys were translated into

132

5

Empirical Basis

German by professional, psychometrics-experienced translators, educational specialists, and statistical domain experts. If necessary, items were revised to include an explicit reference to the context of learning statistics. Following the recommendation of Weiber & Mühlhaus (2014, p. 128), the translated questionnaires had been pretested in a pre-post-test design one year before the implementation in four different methodological and sociological statistics courses at two German universities (N = 297). In the preliminary analyses, scales were shown to have good to very good reliability values. Apart from this, a few indicators were deleted under consideration of content-related appropriateness and Cronbach’s alpha to ensure that each survey took 15 minutes to fill in at the maximum. The initial analyses showed that, for example, the motivational measurement instruments were able to distinguish sufficiently between different subpopulations (including male/female; students with different levels of achievement-related prior knowledge; see Förster & Maur, 2015). Goodness-of-fit values of the pretest had suggested potential for improvement (Förster & Maur, 2016), which will therefore be readdressed in greater depth in chapter 6. Cognitive measurement instruments Formative achievement was operationalized by means of standardized electronical quiz scores1 ; four quizzes were released in each semester in bi- to triweekly intervals. Each quiz consisted of four to five standardized questions and could be accessed for one week and the score reflects the result after students finally submitted the quiz. Question formats included multiple choice (20%), concrete computational tasks (50%), matching exercises (10%), and cloze questions (20%). The first three quizzes covered the preceding contents of the last few sessions while the fourth quiz was a more extensive mock exam including a wider range of contents from the entire semester in a similar extent to that of the subsequent final exam. The questions were identical in both years of assessment (see Figure 5.2 for an exemplary question). It should be mentioned that a motivational component resonates in the performance in the quiz score since only the participation was relevant for exam admission, but not the score itself (see Section 3.1.2). Given that some students processed the quizzes in under three minutes, in can be assumed that they only started the quiz to fulfil the requirement for exam admission. This timeframe would not have sufficed to read all the questions or even to take wild guesses. Since quiz achievement is embedded in a motivational framework anyhow, these

1

For more design features of the quizzes implemented in the study, refer to Section 3.1.2.

5.2 Underlying Circumstances of the Data Collection

133

200 students were asked about their body size. The results can be found in the table below with

representing the absolute and

the relative frequencies.

Body size in cm 140 – 160 160 – 170 170 - 175 175 – 180 180 - 200

30 80 50 20 20

.150 .400 .250 .100 .100

Please calculate mean, modus and median. =

________________________ = ________________________ = ________________________

Figure 5.2 Exemplary Questions from an Electronic Quiz

cases will be retained in the analyses2 . Refusal of conscientiously processing the quizzes is most due to the lacking test consequences or lack of motivation or attainment value. Even though the quiz scores do thus not fully validly allow for inferences about students’ true statistical reasoning capabilities, insufficient participation can be accounted for by the motivational constructs included in the assessment framework (Wise & DeMars, 2005, p. 2; Eklöf, 2010, p. 350). Academic achievement was operationalized by means of the final exam. The exam was in the electronic format and the standardized exam questions were comparable to the electronic quizzes. Exam scores in percentage terms were used instead of exam grades because the latter are sometimes curved and would include construct-irrelevant variance to the course outcome because of arbitrarily set cut-off scores, which Downing and Haladyna refer to as “indefensible passing scores” (i.e., the known 50% passing hurdle; 2004, p. 329). The exam scores, in contrast, represent the full variance of the achievement at the end of the semester. Missing responses within a processed quiz or exam were coded as wrong answers 2

A sensibility analyses has shown that, for instance, removing participants that only achieved a quiz score > 5 %, the unstandardized coefficients of the motivational constructs on cognitive attainment become higher, which may be precisely because those students with less motivation to process the quiz were excluded from the analyses.

134

5

Empirical Basis

(“zero replacement”; Xiao & Bulut, 2020, p. 933), assuming that they would have answered the respective question if they had actually known the answer to earn a score. Measurement instruments for achievement motivation There are quite a few measurement instruments dealing with attitudes towards statistics. Most of them however only entail single particular attitudinal facets, such as statistics anxiety (Cruise et al., 1985; Hanna et al. 2008), value appraisals to and confidence with statistical technology (Davis, 1989), and self-efficacy, selfcontrol, affect, persistence (Budé et al., 2007; Finney & Schraw 2003). Other surveys that include a broader spectrum of indicators only measure statistical attitudes as unidimensional construct (Roberts & Bilderback, 1980; Wise 1985) and would not allow for a differentiated investigation of students’ attitudinal development. Hence, EV appraisals in statistics have been assessed in rather unsystematic and indistinct ways with these surveys. Based on these shortcomings, the SATS36 (Survey of Attitudes Towards Statistics with 36 items; Schau, 2003) was selected as it refers to an overarching, inherently coherent, and comprehensive model based on the EV model to measure achievement motivation in statistical contexts (cf. Schau et al., 2012; Tempelaar, van der Loeff, et al., 2007, p. 80; Wigfield & Eccles, 2002). This instrument has been widely validated with large samples compared to the other ones mentioned above (Nolan et al., 2012; see Section 5.3) and consists of 36 items with a seven-point Likert scaling assigned to six factors. Based on the above-mentioned pretest, four items were omitted because of a low internal consistency and were considered less relevant to the German higher education context. The constructs are aligned to the EV model in such a way that they relate to expectancies of success (Cognitive Competence & Difficulty) and subjective task value (Affect, Interest, Value, Effort; Schau 2012, p. 62; Tempelaar, van der Loeff, et al., 2007, p. 80; Wigfield & Eccles, 2002, p. 94)3 . These constructs are assumed interrelate with formative and summative academic achievement. Cognitive Competence entails the self-appraisals of one’s own competencies and skills in statistics and is conceptually related to the self-efficacy facet of the EV model (Eccles & Wigfield, 2002). Difficulty reflects the personally perceived generic difficulty of statistics as a subject. Affect refers to the extent to which students appreciate or are afraid of statistics. Value represents the appraised lifeand work-related relevance of statistical tasks. Interest and Effort refer to the 3

The concrete assignment of these constructs to the EV dimensions was delineated in Section 3.2.3.

5.2 Underlying Circumstances of the Data Collection

135

students’ intrinsic motivation as well as to appraisals about the invested time and needed endeavors when coping with statistical subject matter. Measurement instruments for achievement emotions Both enjoyment and hopelessness were assessed in in-class and out-of-class contexts, i.e., while attending statistics classes and while performing learning activities outside of the class using a 7-point Likert scale. These constructs were taken from Pekrun’s Achievement Emotion Questionnaire (AEQ; 2006). To emphasize the state-like character of the achievement emotions (see Section 3.3.2), the items concretely referred to emotions experienced during the current statistics lecture or learning activity. Out-of-class emotions were assessed immediately after students processed first and the third electronic quiz. Enjoyment assessed students’ pleasure when learning statistics while hopelessness assesses the appraised degree of non-attainability of success or understanding depending on the course or learning context (Pekrun et al., 2007). All indicators were particularized to the context of studying concrete statistics subject matter to evoke state-like self-appraisals (Putwain, Larkin, et al., 2013). Table 5.2 and Table 5.3 summarizes all variables and constructs along with exemplary items, scaling, and reliability coefficients, which suggest good to very good internal consistency4 . Table 5.2 Variable Definitions, Scaling, and Cronbach’s α Reliabilities Variable name

Abbrv.

Definition / example items

Scaling / α



indicates whether student attended the traditional or flipped course

1 = flipped design

0 COHORT Course design

1 ACADEMIC ACHIEVEMENT 1.1 Summative achievement (administrative data) Final exam score

E

score on the final exam

score from 0 to 100 α for four exams α1 = .756 | α2 = .701 α3 = .780 | α4 = .709 (continued)

4

The full list of items along with their abbreviation can be found in Appendix 2 in the electronic supplementary material.

136

5

Empirical Basis

Table 5.2 (continued) Variable name

Abbrv.

Definition / example items

Scaling / α

1.2 Formative achievement (administrative data) Quiz scores

Q

score on the formative electronic quizzes

score from 0 to 100 α for four quizzes α1 = .773 | α2 = .749 α3 = .846 | α4 = .894

2 SOCIODEMOGRAPHIC CHARACTERISTICS (dummy variables) Female

1 = female

Math experience5

1 = lower experience

Table 5.3 Construct Definitions, Scaling, and Cronbach’s α Reliabilities Construct name

Abbrv.

# of items6

Definition / example items7

Scaling / α for all occasions

3 AME APPRAISALS TOWARDS STATISTICS8 Likert-scale from 1 = strongly disagree to 7 = strongly agree 3.1 SATS-36: Attitudes towards statistics / achievement motivation Self-efficacy9

S

5

“I will understand α1 = .791 | α2 = statistics concepts and .795 equations.” α3 = .836

Difficulty

D

6

“Statistics is a α1 = .684 | α2 = complicated subject.” .707 α3 = .751 (continued)

5

For the later multiple group analyses, a dummy variable based on the median of final math grade was generated. Students with a grade of A and B were considered as students with high experience, while students with a grade of C were considered to be low-experienced (for more details, see Section 9.1). 6 Number of items refers to the initial original solution. In the later analyses, modified solutions with a reduced set of items will be considered as well. 7 The example items also refer to the direction towards which the construct is consistently aligned, i.e., “Statistics is a complicated subject.” Indicates that higher values of the latent variable refer to a higher appraised difficulty. 8 Since both the SATS and AEQ use 7-point Likert scales, they will be treated as continuous (Babakus et al., 1987). 9 This construct is named Cognitive Competence in the original measure.

5.2 Underlying Circumstances of the Data Collection

137

Table 5.3 (continued) Scaling / α for all occasions

Construct name

Abbrv.

# of items

Definition / example items

Affect

A

6

“I will like statistics.” α1 = .810 | α2 = .788 α3 = .794

Value

V

8

“Statistical skills will make me more employable.”

α1 = .823 | α2 = .823 α3 = .831

Interest

I

4

“I am interested in using statistics.”

α1 = .848 | α2 = .831 α3 = .856

Effort

E

3

“I plan to complete all of my statistics assignments.”

α1 = .838 | α2 = .693 α3 = .755

3.2 AEQ: Achievement emotions Course enjoyment

Jc

7

“I am looking forward α1 = .911 | α2 = to learning a lot in .920 this statistics class.” α3 = .919 | α4 = .910

Course hopelessness

Hc

5

“I have lost all hope in understanding this statistics class.”

α1 = .860 | α2 = .877 α3 = .895 | α4 = .889

Learning enjoyment

JL

8

“I study more than required for this statistics course because I enjoy it so much.”

α1 = .882 | α2 = .905

Learning hopelessness

HL

5

“I feel so hopeless that I can’t give my studies for the statistics course my full efforts.”

α1 = .873 | α2 = .888

Note. α = Cronbach’s alpha10 .

10

In the later confirmatory analyses, composite reliabilities of the original and of modified factor structured will be considered as well.

138

5.2.3

5

Empirical Basis

Longitudinal Assessment Framework and Assessment Methods

The variables and constructs were transferred into a longitudinal assessment framework to represent motivational and emotional trajectories in and outside the course with highest possible granularity and coverage throughout the semester. The surveys were coordinated with the statistics professor and embedded in the regular seating time of the course sessions and the electronic quiz to reduce the additional effort for students and hurdles for participation. As an incentive for students to participate and fill in the questionnaires completely, one to two iPads were raffled off after each survey among the participants. In both years of assessment, the same experienced staff and student assistants were involved to ensure the highest level of test security during the surveys. Figure 5.3 shows the distribution of surveys throughout the semester of both cohorts.

Figure 5.3 Longitudinal Assessment Framework. (Notes. # refers to number of the respective measurement occasion; week = week within the summer term 2017/2018 (out of 14); PP = paper-pencil in-course; OS = online survey outside the course; Q = quiz; see Table 5.3 for the abbreviations of the latent constructs (circles); the number following the construct initials refers to the number of the respective measurement occasion. Circle = latent construct(s); rectangle = manifest item(s): green = entry criteria; red = cognitive attainment variables; blue = EV constructs; yellow = emotion constructs; solid line = assessed in-course; dotted line = assessed online/outside the course)

At the very beginning of the course, students’ heterogeneity criteria (gender, migration background, and prior knowledge) and initial EV appraisals were surveyed. The first survey was conducted before the first lecture started to ensure that the first appraisals are not confounded with content-related input. Therefore, the items of the first questionnaire were formulated in future tense (i.e., “I will like

5.2 Underlying Circumstances of the Data Collection

139

statistics”) while the items of all subsequent surveys used present tense. At the beginning of the flipped semester, students were not directly informed about the new teaching style to avoid bias in the first assessment. All subsequent in-course surveys were conducted after the at the end of each course session because, as Gómez et al. put it, to support reliable and valid measurements of AME appraisals, they need to be elicited before (2020, p. 236). Learning-related emotions were assessed outside the course after the electronic quizzes11 to capture students’ emotional states while learning outside the class (see Section 5.2.2)12 . Course- and learning-related emotions were hence assessed in different surveys and learning contexts to avoid inflated resemblances in emotions across these contexts (Ranelluci et al., 2021, p. 11). Assessing both course- and study-related states in the respective contexts was to support external validity and to ensure that the targeted contexts of the emotional states are adequately reflected. Participating in the out-of-class surveys consumed approximately 15 minutes of free time and was rewarded with a lot for the raffle of an iPad. In the measurement framework, students were asked alternately about their achievement motivation (#1, 5 and 9) and achievement emotions (# 2, 3, 4, 6, 7, and 9) to model potentially reciprocal effects according to the CV theory. Attitudes towards statistics (EV appraisals), on the one hand, assumed to be more stable, were recorded at larger intervals during the semester (beginning, midterm, end of semester). On the other hand, statistics-related achievement emotions were assumed to be more volatile compared to EV appraisals and thus assessed five times at equidistant, triweekly intervals. Table 5.4 presents an overview of abbreviations and nomenclature to refer to constructs and their interrelations within the longitudinal empirical framework.

11

The timing and distribution of the four quizzes across the semester varied in both courses (see Figure 5.3), which will be taken considered when modeling the later structural relations. 12 Originally, the AEQ constructs were designed to assess emotions before, during, and after attending the course and studying outside the course. The time reference that students should project their thoughts to in retrospect comes from the assessment instructions. This study however focused on the emotional states during the specific situations with close proximity between achievement situation and assessment to avoid confounded, retrospective projections.

140

5

Empirical Basis

Table 5.4 Abbreviations Used to Refer the Constructs throughout their Longitudinal Assessment Example

Reference / Implication

t5

t refers to the number, or rather time of the measurement according to Figure 5.3, i.e., measurement occasion 5 at week 8.

S5

An upper-case letter followed by one digit refers to the construct self-efficacy only at t5 (week 8)

s1

A lower-case letter followed by one digit refers to the specific item of the construct self-efficacy at all measurement occasions (i.e., item #1 of self-efficacy at t1 , t5 , and t9 )

s51

A lower-case letter followed by two digits refers to a specific measurement occasion (first digit) and a specific item of the construct (second digit, i.e., considering t5 and item #1 of self-efficacy)

X→Y

Unstandardized path coefficient from x to y (y regressed on x)

X↔Y

Correlation between x and y (x with y)

Significance

Depiction of unstandardized coefficient

In-text references

p < .01

Superscript c (e.g., .851c )

p < .05

Superscript b (e.g., .540b )

p < .10

Superscript a (e.g., .301a )

p > .10

no graphical path

No superscript (e.g., .141)

5.3

Quality Criteria of the Study Design and Measurement Instruments

5.3.1

Objectivity Evidence

After all measures were delineated in terms of their terminology and implementation in the assessment framework, the underlying quality criteria along with the implications for subsequent data collection and analyses have to be addressed. For a preliminary evaluation of the quality of the above-outlined study design, the scientific quality criteria of objectivity, reliability, and validity of the inferences stemming from the results of the measurement instruments will be considered. In the following subchapters, each quality criterion will be shortly described and

5.3 Quality Criteria of the Study Design and Measurement Instruments

141

evaluated according to existing theoretical and empirical evidence for the outcome (quiz and exam13 ), motivational (SATS), and emotional measures (AEQ) in the order specified. Objectivity is given when measurements and their interpretations are independent of the researcher. This implies that different researchers cannot come to different conclusions while assessing students’ answers (Moosbrugger & Keleva, 2020, p. 18). Evaluation objectivity can be assumed due to the standardized questions that assign numerical values to each participant independently from the researcher (Lüdders & Zeeb, 2020, p. 121)14 . The fixed range of values of the standardized questions also contributes to a consistent interpretation (Bühner, 2021, p. 569; Moosbrugger & Kelava, 2020 p. 19). The electronic quiz and exam questions were standardized multiple choice, cloze, or matching exercises. The automatic scoring of the answers, apart from plausibility checks, without room for individually differing interpretations is a necessary requirement to use grades as measures of achievement (Cohen, 1981; Downing & Haladyna, 2004, p. 328). Due to the predetermined order of items and the sequence of the questionnaire, the objectivity of test execution underlay no considerable variations (Lüdders & Zeeb, 2020, p. 121; Moosburgger & Kelava, 2020, p. 18). The same test director adhered to the same procedure of test execution at all measurement occasion. Concise and easy-to-read instructions at the first page of each questionnaire shortly explained the study context and how the test should be filled out to ensure that the scanned data can be easily imported in the scanning software. Moreover, specific instructions were given for the rating of each measurement instrument, so that no interaction between test director and participants needed to occur during testing (Lüdders & Zeeb, 2020, p. 121). Each paper-pencil assessment across the semester (i.e., except for the online surveys) was conducted with the same testing materials, at the same weekday and time of day during the course, and in the same lecture hall, so that external conditions should have a negligible impact on testing behavior.

13

In this chapter, quiz and exam scores will only be investigated in terms of their objectivity and test content. Evidence based on the internal structure will be omitted and relations to other variables will be readdressed after the analysis of the measurement and structural models. 14 The further accuracy of data processing and preparation will be readdressed in the scope of validity based on the response processes (Downing, 2003, p. 832; see Section 5.3.5).

142

5.3.2

5

Empirical Basis

Reliability and Validity Evidence Based on the Internal Structure

Objectivity on the level of data collection, input, and interpretation is a necessary requirement for the comparability and generalizability of the test scores that follows the understanding of the Standards for Educational and Psychological Testing (AERA et al. 2014)15 . According to this understanding, validity is the extent to which data and theory support the interpretation of test scores to ensure interpretations conforming to the alleged test use. Measurement validity requires the absence of both random and systematic measurement errors and determines the validity of assumptions derived from test scores (Weiber & Mühlhaus, 2014, p. 159). The factorial validity refers to evidence that each measurement model assesses the underlying construct that they are supposed to measure, and that similar or different constructs correlate with each other as theory would predict (AERA et al., 2014; Kibble, 2017). This validity criterion also entails the reliability of a measurement, implying that the test scores are random error free across repeated measurements (Downing, 2003; Rios & Wells, 2014). Therefore, reliability will be treated as part of this validity aspect. Internal consistency of the SATS-36 The internal consistency of all constructs of the SATS was documented to be good to excellent based on the summative review from Nolan et al. (2012). According to this review, the Cronbach alpha reliabilities of the six constructs range from .78 to .91 except for Difficulty, which performs worst with most values ranging between .51 and .75 (Bechrakis et al., 2011; Coetzee & van der Merwe, 2010; Emmioglu et al., 2018; Stanisavljevic et al., 2014; Tempelaar, van der Loeff, et al., 2011; Xu & Schau, 2021). The six-factor structure of the SATS-36 has been confirmed in numerous construct validation studies (largest and most recent validation study by Xu & Schau, 2020; for others see Persson et al., 2019; Shahirah & Moi, 2019; Stanisavljevic et al., 2014; Tempelaar, van der Loeff, et al., 2007) and is to the present adapted to many different languages, countries, and statistical domains (Emmioglu et al., 2018). The Cronbach’ alphas for the present study were also appropriate, whereby achievement emotion constructs had a higher reliability than the attitudes towards statistics (see Table 5.3). In pre-posttest designs using SEM, the pre-test measures were close to the goodness-of-fit cutoff values, while the post-test yielded better overall fit and thus accounted 15

The traditional understanding of validity is that the construct measures what it claims to measure (Bühner, 2021, p. 601).

5.3 Quality Criteria of the Study Design and Measurement Instruments

143

better for item variability (Vanhoof et al., 2011; Xu & Schau 2021). However, in many studies, the desired cutoffs have only been approached by means of further modifications to the original factor structure and indicators (e.g., Chiesi & Primi, 2009; Hommik & Luik, 2017; Van Hoof et al., 2011; Xu & Schau, 2021). Between-factor correlations are moderate to high across all studies (Nolan et al., 2012). Sometimes, models with a reduced number of factors were compared to the original six-factor model, for example by combining the highly correlated factors Affect, Cognitive Competence, which have nearly perfect correlations in some studies (Emmioglu et al., 2018; Hommik & Luik, 2017; Lavidas et al., 2020; Persson et al., 2019; Xu & Schau, 2021). In most cases, both the original six-factor solution and the reduced-factor solution yielded adequate fit indices. As Chiesi and Primi (2009) put it, the decision for a six-dimensional model or a model with less factors should be driven by the empirical and theoretical interests with which the researcher intends to investigate distinct statistical attitudes. Particularly in many earlier studies, item parcels containing two variables were generated in conjunction with confirmatory analyses as recommended by the scale developers (Chiesi & Primi, 2009; Pritikin, 2018, p. 491; Stanisavljevic et al., 2014; Tempelaar, van der Loeff, et al., 2007, p. 85) while only a minority of studies considered using individual items (Vanhoof et al., 2011; Xu & Schau, 2019). Parceling is often argued to have the advantage of yielding approximately continuous data to compensate high skewness and to achieve a better balance between number of variables and sample size (So, 2010. p. 151; Tempelaar, van der Loeff, et al., 2007). Even though Schau et al. (1995) base item parceling on criteria such as skewness and standard deviation, it remains a means to an end for increasing the reliability of the scales (Persson et al., 2019). In the context of this study, the prevalent approach of item parcels is seen as limitation to the factorial validation of SATS-36 factor structure because the scores of each answer are averaged per parcel and do not provide information on problematic items (Xu & Schau, 2019). Parceling obscures potential sample variance and underlying multidimensional structures while deviating the modelled from the observed data (Persson et al., 2019), as Xu and Schau admit in a recent paper (2021). Hence, path coefficients may be inflated by effect tendencies which are dissociated from realistic estimation of effect weights (So, 2010, p. 151). Specifically, the recommendation of having three parcels per factor cannot be fulfilled for those factors of the SATS-36 containing a small number of items (Vanhoof et al., 2011). In particular, Tempelaar, van der Loeff, et al., (2007) had to create parcels with only one indicator for interest and effort to stick to this rule. Persson et al. (2019) conclude that most of the early studies using the SATS-36 used parceling,

144

5

Empirical Basis

which resulted in a lack of information about the adequate functioning of individual items of each construct16 . The above-mentioned advantages are moreover less relevant for this study due the large sample size, its relation to the number of used variables and the relatively low skews of each variable (see Section 5.4.3). Due to the 7-point Likert scales, the variables can be treated as continuous (Babakus, Ferguson, & Jöreskog, 1987; Mesly, 2015) without item parceling. This study refrains from item parceling by scrutinizing items potentially in need of further evaluation and adopting a more empiricist stance. Internal consistency of the AEQ Compared to the SATS-36, the AEQ scales themselves do not reference an overarching model in which each construct functions as an integral part. Rather, the AEQ offers 8 different scales from three different contexts from which researchers usually select the ones of interest for their own research purposes (Gómez et al., 2020). Cronbach alpha of the enjoyment and hopelessness scales varies from sufficient to excellent values of .78 to .94 (Bhansali & Sharma, 2019; de la Fuente et al., 2020; Fierro-Suero et al., 2020; Peixoto et al., 2015; Starkey-Perret et al., 2018). Pearson product-moment correlations suggest that correlations between the selected emotional states of enjoyment and hopelessness ranged between – .30 and –.40 across several studies. This suggests a sufficient degree of divergent validity while also emphasizing that they have an opposite valence (Pekrun et al. 2011; Bhansali & Sharma, 2019; Lichtenfeld et al., 2012). To further examine the internal structure of the emotion constructs, Pekrun et al. (2004) and Pekrun et al. (2011) tested a general factor model against a hierarchical model including the theoretically assumed four-component structure for each emotion (motivational, cognitive, affective, physiological). Such a validation check will not be conducted in the present study since some of these component structures would only consist of one or two items depending on the emotion (e.g., the physiological component of class-related enjoyment), which is problematic for latent modeling (Geiser, 2010). Apart from hierarchical models based on the four different components, other validation studies tested models depending on the assessment contexts. Few studies tested all emotion scales across all contexts starting with a bipolar, general 16

For instance, most studies found Difficulty to be a factor with the lowest reliability but could provide no information about which items could have been problematic due to the parceling approach. More recent studies using the SATS-36 also focus on individual item functioning (e.g., Hommik & Luik, 2017; Persson et al., 2019; Shahirah & Moi, 2019). These studies will be reconsidered in chapter 6 to draw parallels between the (mal-)functioning of certain indicators.

5.3 Quality Criteria of the Study Design and Measurement Instruments

145

factor model entailing all positive and negative emotions, stemming from the idea of non-differentiation between discrete emotions (Lichtenfeld et al., 2012; Starkey-Perret et al., 2018). The next differentiation assumes a hierarchical model considering the achievement context (class, learning, and test), in which either three second-order contextual factors subsume the different emotional facets, or a second-order emotion subsumes the different achievement contexts (Pekrun et al., 2011; Lichtenfeld et al., 2012). In these studies, the hierarchical models outscored the general factor model in terms of model fit, suggesting that modelling the discrete emotions under consideration of the three achievement contexts best explains the emotional relationships. Follow-up studies for other populations corroborated these findings (de la Fuente et al., 2020; Starkey-Perret et al., 2018). Hierarchical modeling of context-specificity within one second-order emotion is not possible in this study because both contexts were assessed at different points of time to ensure that students answer the questions in an ecologically valid environment (either in-course or at home). The separated model approach will therefore be pursued to ensure that feedback effects can be differentiated according to the achievement contexts (i.e., class- and learning related). Several other studies compared measurement models of the emotional constructs either for only one achievement context (e.g., de la Fuente et al., 2020; Fierro-Suero et al., 2020; Bhansali & Sharma, 2019) or separately for different achievement contexts (e.g., Davari et al., 2020; Peixoto et al., 2015). The results of these studies suggest that fit indices are still appropriate near the upper-bound recommended values of CFI, TLI, RMSEA, and SRMR without a further hierarchized structure. This suggest that inferences made from the achievement emotion models can be considered reliable and valid irrespective of being organized as hierarchical achievement context models or separate class-, learning- or test-related models.

5.3.3

Validity Evidence Based on Test Content

Evidence based on test content refers to a sufficiently precise, sematic, and holistic representation of the achievement domain in which the constructs are measured as regards content, requiring a well-grounded conceptualization and evaluation by experts (Sireci & Faulkner-Bond, 2014). Cognitive outcome in the present study is operationalized by means of the exam and quiz score since existing scales for the measurement of statistical reasoning underlie certain limitations (see Section 2.1.4 and 5.3.3). Another advantage of using course-specific measures regarding content and curricular validity is that they represent the achievement in terms of the contents learned in the course (Pekrun et al., 2014).

146

5

Empirical Basis

This may however affect the international and cross-national comparability17 , so that additional information has to be given to infer the degree of (inter-)national representativeness of these scores18 . Content evidence of the quiz and exam scores To provide a vivid picture of the measurement instruments and to attenuate inferences based on mere face validity (Downing & Haladyna, 2004, p. 332), the items from the measurement instrument will be assigned to and compared with the content areas from the predefined construct (Davidshofer & Murphy, 2013, p. 156; see Section 2.1.3). More concretely, following the recommendations of Kibble (2017, p. 113), Table 5.5 presents a test outline with the share of exam questions covering each respective topic under consideration of the ten most renowned, fundamental statistics books for sociologists and economists in Germany to illustrate in how far the exam relates to the relevant statistical core topics19 . Most of the topics dealt with in the quizzes and exams are also represented by the majority of the renowned statistics books, except for time-series analyses as well as price and value indices, which are more specific topics in economic sciences. In all, the topics taught in the statistics course can be assumed to be approximately represent the nation-wide fundamentals of statistics. Apart from the coverage in the renowned books, each topic was also assessed in at least one quiz and both final exams, thwarting the threat of construct under-representation of the content domain (Downing & Haladyna, 2004, p. 329). The two more uncommon topics (time series and price and value indices) did not occur in both exams, but in only one of them, respectively. Moreover, the exam score represents a broader variety of relevant statistical topics than most of the measurement instruments presented in Section 2.1.2. Hence, even though no formal evidence was gathered on the validity of inferences based on the exam scores, the cognitive attainment variables represent the topics taught across the semester sufficiently well and each topic was dealt with in approximately equal shares across all performance assessments. Downing & Wise (2004, p. 332) also argue for an alternative conceptualization of face validity, where such a conformity between assessment and learning situation increases its acceptance and perceived utility. Using these scores within the assessment also has the advantage that students do 17

Cross-national comparability may be affected because higher education modules and curricula are decentralized in German universities without a common nation-wide standard. 18 As the focus of the present studies is on the quantitative analysis, qualitative information on the cognitive attainment variables will only be depicted rudimentarily. 19 References to the consulted statistics books are included in Appendix 1 in the electronic supplementary material.

5.3 Quality Criteria of the Study Design and Measurement Instruments

147

Table 5.5 Comparison of the Content Assessed in the Quizzes and Exam Compared to its Coverage in the Course and an Exemplary Core Curriculum Topic treated in # books (out of 10)

Q1

Statistical features and variables: sample, population, variables, scales, distribution, frequency and distribution function, histogram

6

X

Statistical measures: position measure, dispersion measure, quantiles, concentration measures

10

X

Two-dimensional 7 distributions: scatterplot, marginal distribution, statistical independency, covariance and correlation, contingency

Q2

Q3

Q4

Coverage in the exams

X

15%

X

15%

X

X

20%

X

X

Linear regression: single and multiple, R2 , non-linear relationships

10

X

X

15%

Time-series analyses: components, trends, smoothing, seasonality

4

X

X

10% (2017)

Price and value indices

6

X

10 % (2018)

Combinatorics: factorials, binominal coefficient; fundamental principle, permutations, combinations Probability theory: random experiments, fundamentals, conditional probabilities, stochastic independent events, Bayes theorem

8

X

25 %

Probability functions

7

not covered in the “Statistics I” course

Point and interval estimation

8

not covered in the “Statistics I” course

Random variables

6

not covered in the “Statistics I” course (continued)

148

5

Empirical Basis

Table 5.5 (continued) Topic treated in # books (out of 10) Hypotheses tests χ2

test, t-test. F-Test

Analysis of variance

Q1

Q2

Q3

Q4

Coverage in the exams

8

not covered in the “Statistics I” course

8, 4, 3

not covered in the “Statistics I” course

3

not covered in the “Statistics I” course

Notes. Q = Quiz; the share of the coverage in the exam refers to the number of exam points allotted to the respective topic, divided by the overall score.

not have to answer additional questionnaires on statistical reasoning, thus ensuring a sufficiently large sample regarding the achievement scores. Moreover, over two decades, the item pool for quiz and exam tasks has been thoroughly crafted, tested, and revised according to their solution frequencies, sufficient discrimination of performance, and accurateness to minimize the occurrence of biased or flawed items that would lead to construct-irrelevant variance (Downing & Haladyna, 2004, p. 327). In terms of content coverage, the curricular comparison in Section 2.1.3 has shown, that the course contents along with the quiz and exam measures mostly cover basal statistical analyses, such as descriptive statistics, correlation, regression, and probability theory while more advanced topics, such as hypotheses tests, mean comparisons and probability distributions are dealt with in the further course of studies. In terms of task difficulty and under consideration of Bloom’s taxonomy (Anderson & Krathwohl, 2001, p. 27), the questions mostly cover the levels of application, analyzing, and then, to a lower degree, understanding, which corresponds to the instructional objectives. The highest level of evaluation is not covered due to the standardized item pool, which can be assumed to be a common practice for large assessments in introductory statistics courses. Aside from the curricular validity, it should be discussed in how far quiz and exam scores reflect true proficiency. Thereby, the varying stakes of quizzes and exam need to be considered. On the one hand, the high-stake exam scores, counting towards the GPA and assessed under proctored and controlled conditions, are an appropriate measure of achievement to make ecologically valid inferences related to students’ real-world study context (Downing & Haladyna, 2004, p. 329). These scores can be construed as true outcome of students’ motivational and emotional development throughout the semester rather than alternative knowledge tests only covering certain statistical aspects (Pekrun et al., 2014). On the other hand, the quiz scores are low stake scores; merely starting each

5.3 Quality Criteria of the Study Design and Measurement Instruments

149

of the four quizzes was a requirement to pass the course regardless of the actual quiz performance. Particularly for performance tests with no severe consequences, achievement is “a joint function of skill and will”, as Eklöf (2010) adequately puts it. This alludes to the positive impact of test motivation on effort, persistence, and performance20 , whereby the absence of test-taking motivation might underestimate students’ true level of proficiency and pose a threat to its valid assessment (Wise & DeMars, 2005). Resonating with Eccles & Wigfield’s EV theory (2002), particularly for low-stake tests, individuals invest their efforts on grounds of the perceived utility and costs of the task, resulting in better achievement (Eklöf, 2010; Rutkowski & Wild, 2015). Experimental studies on low- or high-stake assessments comparing different conditions (i.e., emphasizing personal or social accomplishment; feedback, grading, or monetary incentives) found that test consequences impact motivation and test performance (Rutkowski & Wild, 2015; Wild & DeMars, 2005, p. 6). Regarding the above-mentioned function of utility and costs, even though quizzes do not count towards the final grade, it has various characteristics contributing to comes in utile due to the opportunity to receive meaningful and clear feedback on their current knowledge level for further exam preparation (Wise & DeMars, 2005). This relevance of the feedback for the final high-stake assessment distinguishes the quizzes from other low-stake assessments with no relevant for the actual course that run the risk of being answered without any effort. Regarding the perceived costs of test-taking, the studies suggest that multiple choice or cloze formats, which were used in the quizzes as well, compared to open questions or essays, contribute to a lower perceived mental taxation to further reduce potential hurdles of motivated participation (DeMars, 2000; Wise & DeMars, 2005). The number of questions for each quiz (four to six) was also manageable and varied from easier to more difficult, but delimitable questions, to not hamper student effort (Wise & DeMars, 2005). The low-stake situation also minimizes the risk of cheating in the unsupervised environment as another source of construct-irrelevant behavior (Downing & Wise, 2004, p. 329) since students were not compelled to perform well. It thus can be assumed that the option to

20

The collected data also provides a first indication of the relevance of test motivation compared to mere skill. Hence, correlation coefficients suggest that a higher quiz score is rather associated with willingness than with prior achievement (i.e., r < .2 for the relationship between quiz scores and final math grade in school, but r > .59 for the relationship between quiz score and the question “I made an effort to do as well as possible on the e-quiz.” Standardized regression coefficients suggest that the later variable is twice as meaningful for the quiz score than prior achievement-related knowledge.

150

5

Empirical Basis

receive feedback along with the smaller scope of each quiz should compensate for the lack of a high-stakes context. Content evidence of the SATS-36 For most existing statistical attitude scales, content development and validation are only casually and vaguely addressed (Nolan et al., 2012, p. 108; Roberts & Bilderback, 1980; Wise, 1985). The development of the SATS-36 stands out against the other existing scales in that it has been sufficiently documented with regard to content validity (Emmioglu et al, 2018). Schau et al. (1995) used the nominal grouping technique including experts (four students and two statistics educators) to generate a list according to which attitudes towards statistics of undergraduate students21 were categorized, itemized, narrowed down to the items with the greatest consensus and categorized into the known dimensions. The items were then piloted and refined according to reliability analyses and validated by means of construct and discriminant confirmatory factor analyses (Ramirez et al., 2012; Schau et al., 1995). The scale development however was not primarily driven by theory because the congruency of the dimensions with the EV model was postulated a posteriori (Hilton et al., 2004). Moreover, the two dimensions of “interest” and “effort” have been added in hindsight to increase congruency with EV model (Schau, 2003), but the development and validation process of these indicators remains unclear (Nolan et al., 2012, p. 118; Schau 2012, p. 60). Content evidence of the AEQ The construction of the AEQ was driven by “theory-evidence loops” and based on the CV theory of achievement emotions (Pekrun et al., 2010). Proceeding from exploratory studies on university students, the most prevalent and frequently reported emotions were selected from open questions using event sampling in the context of a high-stakes exam (Pekrun et al., 2004). Several quantitative studies were then conducted to assess the frequencies, components, and relations of the previously gathered categories of emotions to advance scale development. With a variety of reported emotions, such as joy, relief, hopelessness, the findings emphasize the multi-dimensional nature of emotions. In subsequent studies, a large item pool covering these different facets was constructed under consideration of the relatable statements from the interviews (Pekrun et al.,

21

In the present study, the measurement instrument is thus used in a population that the instrument was mainly intended for, namely undergraduate students attending their first statistics course.

5.3 Quality Criteria of the Study Design and Measurement Instruments

151

2004). Constructs were determined to ensure coverage of all theoretically relevant valence and activation categories and a sufficient degree of discrimination (Pekrun et al., 2011). Scale development, in turn, furthered the theory-based taxonomy, yielding to the assumption of a multi-component structure of emotions, consisting of affective, motivational, physiological, and cognitive dimensions. Even though evidence based on the content validity of the inferences from the AEQ is limited, the constructs have been widely established in different countries among different populations (elementary school, high school, university), and subjects such as mathematics, psychology, physics, and foreign language learning (Bhansali & Sharma, 2019; Camacho-Morles et al., 2021; De La Fuente et al., 2020; Starkey-Perret et al., 2018), suggesting a good performance of the measurement instruments albeit the research context.

5.3.4

Validity Evidence Based on Relations with other Variables

Evidence based on the relationship to other variables is fulfilled if the scales relate to external measures in theoretically and empirically supportable ways (AERA et al., 2014). The first concept is that of construct validity. Construct validity indicates how well a measure reflects the underlying construct conceptually by linking the observed scores to another underlying model or theory. Hence, nomological validity as part of the construct validity is moreover assumed if relationships between two or more constructs correspond to a theoretically justifiable nomological network (Weiber & Mühlhaus, 2014, p. 161). Convergent and discriminant validity examine the relationship between theoretically similar and different constructs, respectively (Kibble, 2017). The Standards also suggest the analyses of group memberships for the attestation of convergent validity, if theory supports that group differences should be present22 (AERA et al., 2014). Moreover, test-criterion relationships refer to the accuracy with which test scores predict a simultaneously or an a-posteriori measured criterion (Adamson & Prion, 2012; Bühner, 2021, p. 602; Weiber & Mühlhaus, 2014, p. 157). For instance, criterion validity is assumed in the presence of high correlations between the measured construct and an appropriate, external criterion, such as academic achievement. 22

Such differences in statistics education for attitudes and emotions have already been shown in the theoretical part will be analyzed later with regard to the present study in the scope of the multiple group analysis.

152

5

Empirical Basis

Evidence based on other variables for the SATS Convergent validity23 was investigated in the early distribution phases of the SATS. A first study from Schau et al. (1995) yielded high correlations between SATS constructs and the “Statistics Attitudes Survey” total score (Roberts & Bilderback 1980). Most studies analyzed convergent validity with regard to the “Attitudes Towards Statistics” questionnaire from Wise (1985), which had been the most used instrument in the statistics context up to that point. This questionnaire assesses attitudes towards the statistics course and the field of statistics with two separate constructs. Affect, difficulty, and self-efficacy were found to correlate positively with the course scale while the field scale correlated more strongly with the value construct (Chiesi & Primi, 2009; Nolan et al., 2012). Although the state of research on the convergent validation in the context of the SATS relies on few studies, they are consistent in that they indicate moderate to high correlations with other renowned statistics attitude scales. Test-criterion relationships for SATS mostly rely on attitude-achievement relationships, whereby achievement is measured by means of course grades and examination scores (Nolan et al., 2012; Stanisavljevic et al., 2014). Regression analyses and structural equation models reveal that statistics-related attitudes may account for up to one fifth of the variance of academic achievement (Nolan et al., 2012). Significance and height of the regression as well as correlation coefficients depended on the respective construct. The strongest and most consistently significant relations were found between achievement, self-efficacy and affect (Cashin & Elmore, 2005; Chiesi & Primi, 2009; Lavidas et al., 2020; Paul & Cunnington, 2017; Stanisavljevic et al., 2014), while coefficients for difficulty and value were often insignificant and lower compared to the other constructs (Finney & Schraw, 2003; Stanisavljevic et al., 2014; Tempelaar, van der Loeff, et al., 2007). Strikingly, in the studies summarized by Nolan et al. (2012), interest and effort were found to not predict academic achievement24 . Effort positively 23

For the SATS-36, studies on discriminant validity are scarce to non-existent (Nolan et al., 2012), so that this aspect will not be considered in this chapter. In a meta-analytic review on validation studies of the SATS-36, Nolan et al. (2012) construe low to moderate correlations of the SATS-36 with a measure of attitudes towards mathematics found in Nasser (2004) to be evidence of discriminant validity. However, since mathematics attitudes have been shown to contribute to statistics attitudes for beginning students, this conclusion might be rather premature. This is emphasized by the correlations of cognitive competence and affect with math attitudes (.49/.47; Nasser, 2004), which are too high to serve as evidence for discriminant validity. 24 Tempelaar and Schim van der Loeff (2011) infer that the low relation of effort with achievement might stem from the fact that conceptualization of the scale resonates with surface learning strategies, which will be re-addressed in Section 10.1.2.

5.3 Quality Criteria of the Study Design and Measurement Instruments

153

predicted achievement in other studies outside the scope of their literature review (Stanisavljevic et al., 2014). In most studies, the proximity of time between test administration and exam grade collection also increased the predictive validity of the scales (Chiesi & Primi, 2009, p. 310). Evidence based on other variables for the AEQ There seems to be no study investigating the convergent and discriminant validity of achievement emotion relations with other variables. Moreover, systematic validation studies are mostly restricted to the beginning phase of scale distribution (Pekrun et al., 2011), provided only correlational evidence for external validity, and consisted of small sample sizes < 20. No study was found providing a summary of existing validation studies. However, the AEQ has a strong theoretical grounding in the CV theory, whereby achievement emotions measured by means of the AEQ yielded adequate predictive power in explaining students’ achievement and CV appraisals. In several studies, self-efficacy, interest, and effort were found to correlate positively with joy and negatively with hopelessness stemming from their respective activating and deactivating nature (Gómez et al., 2020; Jacob et al., 2019; Peixoto et al., 2015). In accordance with CV theory, positive activating emotions, such as enjoyment, positively correlated with academic achievement, and vice-versa for negative deactivating emotions, such as hopelessness, (Frenzel, Pekrun, et al., 2007; Gómez et al., 2020; Peixoto et al., 2015; Pekrun et al., 2011). Pekrun et al. found that the achievement-related correlations were stronger for the learning-related emotions than for course emotions (2011). In a meta-analysis entailing 68 studies with a sample of n = 30,000 participants, Camacho-Morles et al. (2021) found a positive, moderate meta-analytic correlation for enjoyment and academic achievement while the correlation for hopelessness was close to zero. The relationships illustrated in this subchapter suggest a substantiative nomological network for both SATS and AEQ and an appropriate foundation to investigate reciprocal appraisal-achievement relationships. These will be readdressed and investigated during the structural modeling of the present data. The reoccurrence of the theoretically assumed relations in the collected data along with a good fit of the causal models would then be an indication of the construct validity of the study (Weiber & Mühlhaus, 2014, p. 161).

154

5.3.5

5

Empirical Basis

Further Relevant Validity Criteria

Evidence based on response processes Evidence based on response processes is to ensure that the cognitive processes or behavior during the assessment reflect the respective constructs intended to assess (Knekta et al., 2019; Kibble, 2017). The accuracy of the response processes to the alleged constructs can usually only be tested by means of think-aloud interviews, which have not yet been conducted in the context of this study (see Section 10.2.2). As mentioned in Section 5.2.2, measures were particularized to learning with concrete statistical subject matter and assessed in students’ immediate learning contexts while attending the course and while studying outside the course (i.e., at home; Putwain, Larkin, et al., 2013). The ecologically valid environment, the well-situated and context-specific instructions to each category of questions, as well as the integration of assessment and course framework (see Section 5.2.3) was to encourage the evocations of the statistics-related response processes consistent with the scale developers’ intentions (Downing, 2003, p. 834; Kibble, 2017, p. 114; Knekta et al., 2019). From a methodological perspective, response process evidence also relates to the accuracy of the data input (Downing, 2003, p. 834; Kibble, 2017, p. 113). In other words, sources of human error associated with the workflow of the observers’ records from data collection to data input and evaluation (i.e., computation of composite scores and their usefulness) have to be minimized to the greatest extent possible25 . Moreover, the questionnaires were created based on the LaTeX syntax, so that the paper-based, filled-out surveys were scanned and imported almost fully automated by means of the optical mark recognition program SDAPS (Berg, 2021). The automated procedure reduces the risk of human error given the high amount of data in the longitudinal framework. Ambiguous responses (i.e., illegible crosses, crosses outside or in-between checkboxes, double crosses) were filtered by the software for individual inspection. The raw data was then imported into Excel. In Excel, some categorical and multi-digit variables needed to be processed further by means of standardized formulae to yield a compact and interpretable dataset. The integrity of this conversion process was confirmed by checking the correspondence between 50 randomly selected questionnaires and the converted data as well as by considering the permissible value range of all entered variables. Using descriptive procedures, cases with 25

Validity on the accuracy of the data also entails securing a good functioning of each test item (Downing, 2003, p. 834). As mentioned in Section 5.2.2, the items had been pretested to eliminate those items based on their reliability, redundancy, and content-specific appropriateness.

5.3 Quality Criteria of the Study Design and Measurement Instruments

155

tendentious response behavior were identified and depending on the number of implausible values were retained or recoded into missing values. Recoding of inversed items, missing data, and indexing was conducted via automized syntax commands. These mostly automated procedures ensured a consistent processing of the large dataset to ensure the comparability of all interpretations with the original scales. Evidence based on the consequences of testing The consequences of testing refer to the degree to which the practical need of the study is balanced against potential liabilities on part of the participants (Knekta et al., 2019). Whereas this aspect of validity has a higher weight for placement assessments, such as in the United States (Downing, 2003, p. 836; Kibble, 2007, p. 116), the purpose of this study is to evaluate students’ state of mind in classical and blended statistics courses. Since this evidence is predominantly used to determine course and institutional improvements of higher education teaching, the unintended consequences on part of the individual participants are deemed negligible. As regards additional effort for participation, the in-course surveys, quizzes, and exams were part of the regular course schedule. Only the participation in the two surveys outside the course at t3 and t7 , required spending 15 minutes free time, which is deemed tolerable considering the raffled off iPads. The conformity of the assessment framework with the General Data Protection Regulation (GDPR) had been coordinated with data protection official of the university: Each survey introduced the purpose of the study in understandable language while also emphasizing the voluntariness of the participation. Students were given the choice during the first and second survey to actively opt in the assessment by giving their consent to process their data and their user IDs to match the different survey data sets throughout the semester26 . A procedure directory27 was drawn up to limit the usage of the data to specific, responsible project assistants and to the specific purpose of teaching evaluation. The consequences of testing also entail the question of whether the formative assessment by means of quiz feedback has positive consequences regarding students’ attitudes and future performance, i.e., if the intended consequence is attained (Adamson & Prion, 2012, p. 384; Kibble, 2007, p. 116). This question will be readdressed after the analysis of the structural models.

26 27

Art. 6a GDPR “Lawfulness of processing”. Art. 30 GDPR “Records of processing activities”.

156

5.3.6

5

Empirical Basis

Implications From the Reliability and Validity Evidence for the further Analyses

The above-mentioned validation studies suggest that the SATS-36 and AEQ are widely established measurement instruments with adequate psychometric quality. However, some issues that arose in checking the validation studies need to be revisited or factored in the subsequent analyses. First, the instruments must be checked for their data fit and their dimensional structure as the literature review has shown that some original constructs need to modifications. For the SATS, attention has to be devoted to higher between-factor correlations, i.e., regarding Affect and Cognitive Competence. Moreover, the prevalent item parceling approach obscured occasionally malfunctioning items, such as for Difficulty and Value. This is why the factor structure and individual items will be double-checked by means of confirmatory and exploratory factor analyses to see whether there is a common or hierarchical structure associated with some of the constructs, or whether individual items have cross-loadings on several constructs. The AEQ was validated while testing several hierarchical models according to the component- and context-specific structures. For the present study, most of these structures are not applicable (see Section 5.3.2), so that, for reasons of completeness, the separate emotional models will at least be tested against a common factor in which the separate emotions are nested within the course- and learning-specific assessment context. Concerning the quiz and exam scores, both adequately and evenly represent what students learn in the course throughout the semester. The low-stake context of the quiz is deemed negligible but will be revisited when it comes to the analyses of relations between prior knowledge, effort, and achievement. To sum it up, the Standards state that “validation is the joint responsibility of the developer and user” (AERA et al., 2014, p. 13), entailing a twofold implication for the further analyses. First, the above-mentioned issues have to be considered and, ideally, addressed in the modelling process and, secondly, the replicability of the findings (i.e., predictive relationships, group-specific differences) has to be checked with the with the sample drawn in the context of this project.

5.4 Samples

5.4

Samples

5.4.1

Participants

157

The above-elaborated validity criteria and the hypotheses models from Section 4.5 will be evaluated by means of two cohorts in 2017 and 2018 according to the two course setups described in Section 5.2.1. Both cohorts consist of different students, except for those who have failed the 2017 course (N = 81) or did not attend the final exam (N = 64), which participated again in 201828 . In two large simulation studies, Savalei (2010, p. 364) found that, even when using robust estimators, large sample sizes (N > 500) are important to fit larger models to incomplete and not perfectly normal data (see Sections 5.4.2 and 5.4.3). Moreover, a large sample can compensate for convergence problems resulting from a higher amount of missing data (Lei & Shiverdecker, 2019, p. 15). In an attempt to merge both datasets to obtain a larger sample size, the course entry criteria (such as gender, cultural background, prior knowledge) and course outcomes (quiz and exam scores) of the traditional lecture (cohort 1) and the flipped course (cohort 2) are evaluated according to their comparability. The other constructs relevant for the research questions (i.e., motivational and emotional course outcomes) will be reanalyzed later in the context of multiple group analysis to investigate whether the course design of 2017 or 2018 impacts their cognitive-motivational development differently throughout the semester. Table 5.6 provides a mean comparison of both samples. The differences between the traditional lecture sample (S1) and the flipped course sample (S2) were mostly insignificant. There are three significant differences between both samples. First of all, students enter the traditional course with a slightly better final school grade, which may be due to a relaxation of the admission requirements compared to 2017. This entry difference also seems to translate into worse final exam scores for the flipped course in such a way that students achieved a marginally smaller score of averagely 2.88 points out of 100 in the final exam. Students from the traditional course performed slightly worse in the first quiz while these differences do not persist in all subsequent 28

The 145 students who had participated twice may have restricted the independence of both samples. This circumstance is accepted because a removal of the repeaters from the second sample would impact the sample comparability. For instance, assuming that repeaters have certain affective and cognitive characteristics leading to dropout, their omission would likely render the second sample artificially different from the first sample, in which repeaters from prior semesters are still included (e.g., summer term 2016). Therefore, repeaters were kept in the sample.

158

5

Empirical Basis

Table 5.6 Mean Comparison and their Effect Sizes for Both Samples traditional course flipped course variable

mean (SD)

mean (SD)

mean difference [Cohen’s d]

1 COGNITIVE ATTAINMENT 1.1 Summative achievement (administrative data) final exam

.609 (.173)

.581 (.189)

.028c [.156]

1.2 Formative achievement (administrative data) quiz 1

.554 (.210)

.595 (.212)

–.041c [–.193]

quiz 2

.599 (.206)

.591 (.224)

.008 [–.036]

quiz 3

.523 (.298)

.543 (.306)

–.020 [–.064]

quiz 4

.510 (.261)

.501 (0274)

.009 [–.033]

1.3 Prior school achievement final school grade

2.353 (.574)

2.416 (.521)

–.063b [–.115]

math school grade

2.493 (.967)

2.559 (.952)

–.066 [–.068]

taken advanced math course

.370 (.483)

.329 (.470)

.041 [.086]

2 SOCIODEMOGRAPHIC CHARACTERISTICS male

.536 (.499)

.521 (.500)

.015 [.030]

at least one parent not born in Germany

.345 (.476)

.341 (.474)

.005 [.010]

first language German

.814 (.389)

.799 (.401)

.015 [.039]

Notes. SD = standard deviation; a p 0), so that there is a small range of different answers. Pekrun et al. also found the hopelessness scales to be positively skewed in their validation study among all emotional scales (2010). Their explanation is the “extreme nature” and “rare occurrence” of this emotion in achievement settings. Hence, students rather tend to negate that they feel hopeless while learning for statistics in and out of class. The indicators hL 1 and hC 1, which refer to “frustrating” situations do not have a pronounced floor effect, so that frustration seems to be a more relatable, moderate emotional manifestation compared to hopelessness. Beyond that, jL 2, jL 3, jL 8, jC 2, and jC 3 have a moderate to strong ceiling effect, while jL 22 is additionally leptokurtic, and skewed to the left. These deviations likely stem from the formulation (“Reflecting on my progress in statistics course-work makes me happy.”). This item does not primarily assess an emotion appraisal, but an appraisal on a general causality of progress and

170

5

Empirical Basis

happiness, which is unlikely to be negated. The performance of this item should be reconsidered in the later model optimizations. The quiz score distribution is also slightly skewed to the left, indicating that more students have a higher score, but it comes closer to a zero skew at the end of the semester. The distribution of quiz scores 3 and 4 becomes increasingly platykurtic (kurtosis < 0), indicating a broader performance range. This might be because the quiz was low stake with scores did counting towards the exam, leading to more heterogeneous distributions. A potential countermeasure to avoid scores at the lower end would be the implementation of passing limits or bonus points. These measures however also may have unfavorable implications for the assessments, which were discussed in Section 3.1.2. In all, most of the variables are moderately skewed and kurtotic with occasional outliers, so that moderate non-normality can be assumed for these constructs. Some indicators of the hopelessness constructs divert more strongly from a normal distribution while the large sample size of above 200 participants should compensate for these skewed indicators according to the Central Limit Theorem (Lei & Shiverdecker, 2019, p. 15). Moreover, the presence of univariate normality does not preclude multivariate non-normality, which would violate the usual FIML estimations. To avoid wrong assumption of multivariate normality and to account for potential violations to normality due to the original categorical ordered scaling of most of the variables, it therefore seems more advisable to preventively assume (multivariate) non-normality of the data. Based on the 7-point-Likert scale, the variables could be treated as continuous (Babakus et al., 1987; Mesly, 2015). However, ordinal measurements retain a discretized, “crude nature” that will “induce some level of non-normality to the data” (Finney & DiStefano, 2003, p. 274; Li, 2015, p. 938; Suh, 2015, p. 568). Even though the variables are approximately normally distributed, it may not be sufficient for obtaining an optimal SEM solution. Bearing in mind the findings of this chapter, the appropriate Mplus estimator has to be selected as a basis for the performance analyses.

6

Evaluation of the Unmodified Measurement Models

6.1

Choice of an Appropriate Estimator

Combining the insights from section 5.4, the estimator for the present study should adequately handle partially non-normal, incomplete data1 to avoid bias in standard errors for incomplete nonnormal data despite the large sample. Apart from these considerations, a deciding factor for the appropriate estimator in Mplus lies in the scaling of the indicators. As mentioned above, the scaling of the ordinal Likert indicators can be assumed to be continuous due to the seven response options. Since the indicators are still ordinal by definition, the diagonally weighted least square estimator (WLSMV) as another popular frequentist approach specifically designed for ordinal data has to be considered an option (Enders & Baraldi, 2018, p. 153; Li, 2015, p. 937; Pritikin, Brick, Neale, 2018, p. 490). The WLSMV procedure employs probit regressions to estimate polychoric or tetrachoric relations based on the bivariate frequencies for pairwise available cases (Enders & Baraldi, 2018, p. 158). Various studies have shown that the WLSMV produced accurate factor loadings with ordinal data and had better convergence in most conditions, whereas ML underestimated parameter estimates under various sample size and distributional conditions (Li, 2015, p. 946) or only 1

Another addition to the estimation would be to draw on expected versus observed information matrices (INFORMATION = EXPECTED in Mplus). Since the missing data mechanism is likely not MCAR for the datasets, both information matrices are not asymptotically equal, and the expected information standard errors would not be consistent. (Savalei, 2010), so that only the observed ones are considered in the present study.

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-658-41620-1_6.

© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 A. Maur, Electronic Feedback in Large University Statistics Courses, https://doi.org/10.1007/978-3-658-41620-1_6

171

172

6

Evaluation of the Unmodified Measurement Models

yielded negligible differences (Suh, 2015, p. 579); these studies were however conducted with complete datasets (Lei & Shiverdecker, 2019, p. 4). Therefore, a major drawback of the WLSMV as a limited-information estimator in the context of this study is that it assumes MCAR along with an unsubstantial amount of missing data. Hence, its accuracy and efficiency are more seriously affected by missingness than full-information estimators due to the pairwise deletion procedure (Asparouhov & Muthén, 2012; Chen et al., 2019, p. 99; Enders, 2010, p. 41; Lei & Shiverdecker, 2019, p. 15; Liu & Sriutaisuk, 2019, p. 564)2 . More concretely, type I error tended to be inflated and standard errors were more variable when the latent distribution deviated from normality (Chen et al., 2019, p. 99; Lei & Shiverdecker, 2019, p. 4; Suh, 2015, p. 568). Based on the analyses in section 5.4.2, the large missing ratio of about 25% and the untenable MCAR assumption does not justify the usage of the WLSMV (Chen et al., 2019, p. 99; Shi et al., 2019). Alternatively, missing data could be imputed manually beforehand to obtain multivariate summaries that can be analyzed with the WLSMV (Enders & Baraldi, 2018, p. 153; Pritikin, 2018, p. 491). This technique however is not well applicable considering the higher amount of missing data and the unreasonableness to impute data of dropped out participants (see section 5.4.2). Comparing the relative performance of WLSMV with ML(R), Lei and Shiverdecker found that ML(R) with explicitly specified categorical variables yields the most accurate estimates and was least affected by missingness as long as the sample size was >200 in a Monte Carlo simulation (2019, p. 1). This specified estimator accounts for non-linear relationships between the discrete indicators and the respective latent variables mechanism (Lei & Shiverdecker, 2019, p. 14). The procedure however requires numerical integration, which is computationally too demanding in the present study due to the high number of latent variables and exponentially high number of integration points (>4; Asparouhov & Muthén, 2012; Lei & Shiverdecker, 2019, p. 14). Nevertheless, the ordinary ML(R) for continuously treated variables was also found to be mostly unaffected by missing data and factor loadings yielded no bias for only moderately skewed3 indicators with at least five categories (Chen et al., 2019, p. 99; Lei & Shiverdecker, 2019, p. 15).

2

For purposes of testing, some structural models were compared with each other using the MLR and WLSMV. Strikingly, inconsistencies in terms of significance only occurred for paths where the estimation is based on data assessed after half and near the end of the semester, which could be caused by the increasing panel mortality. 3 Skew approximately smaller than 2 in absolute values.

6.1 Choice of an Appropriate Estimator

173

Therefore, the focus shifts back to the traditional FIML estimation, whereby additional auxiliary variables will be used in the present study to attenuate potentially existing MAR missing patterns (see section 5.4.2). The corresponding ML in Mplus estimates polychoric correlations from a generalized linear model and uses all available response patterns to estimate the model parameters (Lei & Shiverdecker, 2019, p. 4; Suh, 2015, p. 596). Traditional maximum likelihood estimation and the likelihood ratio statistic however reside on the assumption of multivariate normality to estimate standard errors and model fit (MaydeuOlivares, 2017). With non-normal data, the loglikelihood ratio test deviates from the χ2 distribution, so that the χ2 value becomes inflated and standard errors too small (Finney & DiStefano, p. 274; Lei & Wu, 2012, p. 176; Li, 2015, p. 937). This results in increasing probabilities for erroneous rejections of the null hypothesis for path coefficients (type I error), implying that estimates with actual zero effect would be deemed significant in the model. For multivariate non-normal data, sandwich estimators are recommended to obtain consistent standard errors (Savalai, 2010). MLR is the only maximum likelihood estimator with robust standard errors to handle incomplete nonnormal data (Savalai, 2010) and is thus considered suitable given the incomplete and partially non-normal data of the longitudinal framework (see section 5.4.3). The MLR represents the asymptotical equivalent to the meanadjusted Yuan-Bentler T2* test statistic (Wang & Wang, 2012, p. 61) and adjusts standard errors and χ2 test statistics for robustness against non-normality (Li, 2015, p. 938). These adjustments are based on an estimate of multivariate kurtosis. The adjustment is referred to as scaling correction factor and represents the ratio of the original χ2 to the adjusted χ2 (Hoyle, 2011, p. 58). In extensive simulation studies under different model conditions (model size, model misspecification) and sample size conditions (200–1000), different estimation techniques were compared (Finney & DiStefano, 2003; Maydeu-Olivares, 2017; Savalai, 2010). MLR yielded robust standard errors in all conditions, except for smaller sample sizes (N = 200–500), smaller models, and extreme deviations from normality. Contrasting these findings with the present study, the minor deviations from normality, the sufficient sample size (greater than 500), and the high number of parameters in the later structural models should contribute to the robustness of standard errors. The conventional ML estimator is robust to minor deviations from non-normality (Brown, 2006, p. 379). However, as mentioned above, the normality assumption may be violated when the indicators are ordinal by definition and have few response categories (Li, 2015, p. 378). As shown in section 5.4.3, descriptive statistics suggest that most indicators deviate slightly from the normal distribution and few indicators more heavily. Hence, it seems safer to use the

174

6

Evaluation of the Unmodified Measurement Models

MLR because it provides accurate, robust standard errors and fewer type I errors regardless of the magnitude of non-normality in conjunction with incomplete data (Maydeu-Olivares, 2017; Savalai, 2010). Following Finney & DiStefano (2003), both the standard error as well as χ2 differences will be compared for ML and MLR, and scaling factors4 will be considered in the later analyses to evaluate whether the assumption of multivariate normality might be violated.

6.2

Specification of the Factor-indicator Effect Direction

To ensure a sufficiently precise estimation of the assumed relationships in the later structural models with the MLR estimator, the measurement models need to fulfil common psychometric standards. Since the goal of this study is to contribute a deeper understanding of cognitive, motivational, and emotional feedback processes, a high number of latent constructs, parameter estimates, and relations must be accounted for. To avoid misspecifications or inappropriate model properties in the later causal models, particularly since further invariance restrictions are needed for most of the following longitudinal analyses, the model evaluation will be preponed to ensure the appropriateness of all measurement models from the bottom up. Therefore, in a first step, the measurement models will be specified and optimized separately in such a way that they yield acceptable fit values at all measurement occasions in the scope of configural invariance, implying a constant number of constructs and pattern of loadings over time. An important basis for the choice of evaluation criteria lies in the specification of the constructs. Following Bollen’s appeal to make a considerate choice on the appropriateness of cause or effect indicators, the factor-indicator effect direction must be determined before analyzing the measurement models (1989). Formative measurement models usually consist of a fixed set of causal indicators, where each indicator determines a specific facet of the respective theoretical construct (Weiber & Mühlhaus, 2014, p. 256). Regarding the interchangeability of indicators, exchanging one causal indicator would change the semantic definition of the latent construct, but may have no effect on the other indicators (Diamantopoulos & Winklhofer, 2001; Weiber & Mühlhaus, 2014, p. 256). Therefore, the indicators generally have lower correlations and are not interchangeable as they only represent the meaning of the construct appropriately in their entirety. 4

If the scaling correction factor is 1.0, the data is assumed to be multivariate normally distributed, and parameter estimates are equal to ML. A higher scaling factor represents a higher multivariate kurtosis (Hoyle, 2011, p. 58)

6.2 Specification of the Factor-indicator Effect Direction

175

For reflective measurement models, a variation of the latent construct causally impacts the observable manifestations in the underlying effect indicators, so that the effect direction is reverted from the construct to the indicator (Eberl, 2004, p. 3; Geiser, 2010, p. 41). Jarvis et al. (2003) offer decision rules to further determine the factor-indicator effect direction, a selection of whom is shown in Table 6.1. Table 6.1 Decision Rules for Factor-Indicator Effect Directions 1. Direction of causality • Variation of the latent construct is not • From construct to item (R). assumed to be induced by formative • Indicators are manifestations of the indicators, but by quiz scores. Influence construct (R). from the quiz score to the construct then • Changes in the construct cause changes in translates into the observable the indicators (R). manifestations of the respective behavior inside and outside the course (causal priority from construct to indicator). • Indicators reflect observable causes of the reflected construct (e.g., if students are self-efficacious, they evaluate themselves to be able to understand statistics formulae). 2. Interchangeability of the indicators • Indicators are interchangeable (R). • Indicators share a common theme (R).

• E.g. omission/exchange of “I plan to study hard for the statistics exam” is not expected to change the overall statistics-related effort construct. The construct can also be represented by another exemplary manifestation related to the workload in statistics if it has comparable reliability values.

3. Covariation among the indicators • Indicators are expected to covary with each other (R). • A change in one of the indicators is associated with a change in the other indicators (R).

• Covariance among the indicators is assumed to be caused by the latent variable, e.g., variation in self-efficacy causes variation in all the observable appraisal manifestations. (continued)

176

6

Evaluation of the Unmodified Measurement Models

Table 6.1 (continued) 4. Nomological net of the construct indicators • Nomological net should not differ (R).

• Indicators of each motivational and emotional constructs overlap to measure the same respective construct; they measure different aspects of the same concept (e.g., enjoyment to go to statistics class & enjoyment to listen to the professor during class).

Notes. R = Criterion suggests the reflective nature of the constructs.

The interpretation of these decision rules leads to the assumption that the measurement models in this study are reflective. In most psychological studies, attitude and personality scales are construed as effect indicators (reflective measurement models) since the latent, underlying trait evokes these measurable facets of that behavior (Fornell & Bookstein, 1982). For instance, the statistics-related effort is an attitudinal construct determining the indicators that operationalize the construct, such as “I plan to work hard in my statistics course.” Such indicators are presumed to underlie one unidimensional cause measured by several items (Weiber & Mühlhaus, 2014, p. 109). Following this decision, the measurement models for each item-construct association will be depicted as a basis for the subsequent evaluation of the measurement performance.

6.3

Implications of the Construct Specification for the Model Evaluation

Reflective measurement models are operationalized following the concept of multiple items. For each reflective measurement model, reference indicators are selected to provide the latent construct with a metric (fixed marker scaling) and to identify the latent mean structure (Geiser, 2021, p. 118)5 . Mplus automatically fixes the loading of the first indicator of the construct at one. Ideally, the reference indicator should adequately represent the construct, which is why indicators with consistently high loadings at all measurement occasions under consideration of theory-conformity are selected. The homogeneity of indicator loadings also 5

An alternative way to identify latent variables is to fix the factor variance and mean while estimating all factor loadings freely (fixed factor method; Geiser et al., 2015). This approach is not always recommended for longitudinal studies due to the assumption of equal factor variances across time and will therefore not be pursued in this study. The different scaling approaches result in statistically equivalent models with identical overall model fit.

6.3 Implications of the Construct Specification for the Model Evaluation

177

ensures that the item does not bias the tests for model invariance on grounds of its potential non-invariance that would be masked due to its standardization (see section 8.1). The loading of these reference indicators must be fixed at one manually while freeing the loading of each first indicator. Only correlative relationships are allowed between factors since causal paths will be analyzed after model optimization. A second-order factor including all measurement occasions per construct will not be modelled because the aim of the study is to determine reciprocal effects and the distinct contributions of each emotional and motivational facet according to EV and CV theory to achievement and vice-versa6 . Enjoyment and hopelessness will not be modelled as general factor for the reasons illustrated in section 5.3.2. Rather, two models were separately tested for class- and learning-related emotions (as in Davari et al., 2020; Peixoto et al., 2015 etc.). The standardized factor loadings in Mplus, referring to the correlation between the observed values and the factor to which they are assigned, provide an indication of indicators that do not sufficiently relate to the hypothetical construct (Geiser, 2010, p. 63). The approach of an unparcelled CFA was selected to address the aforementioned limitations of earlier research while determining concrete potential creating strong unidimensional measurement models (Landis et al., 2008, p. 210) by eliminating problematic indicators. Each removal will be reflected regarding their specific content, whereas in most previous studies, such weak spots had been whitewashed by means of parceling (see section 5.3.2). The appropriateness of each standardized coefficient and the reason for low values will be dealt with when analyzing the indicator reliabilities in the following chapter. In contrast to formative models, reflective indicators should have higher correlations as they reflect consequential behavior of the latent construct (Eberl, 2004). Following the assumption of homogeneity, the indicators represent independent measurements of the theoretical concept, so that any selection of indicators randomly measures its nomological concept. The indicators are therefore interchangeable, exemplary manifestations of this theoretical concept (Weiber & Mühlhaus, 2014, p. 108). The omission of inappropriate items might therefore lead to an increase of the measurement quality of the construct and should not affect its overall integrity (Diamantopoulos & Winklhofer, 2001, p. 271). This is why, in the following chapters, the global goodness-of-fit of the motivational and 6

To check the state or trait character, second-order models for each dimension across time had been modelled, but they had a significantly worse fit than the models where each measurement occasion was represented by means of a single construct.

178

6

Evaluation of the Unmodified Measurement Models

emotional measurement models will be evaluated as a starting point to deduce further need for optimization by deleting single indicators.

6.4

Global Goodness-of-fit of the Unmodified Measurement Models

A first reference point for optimization is the plausibility of the parameter estimates (Weiber & Mühlhaus, 2014). For the joint evaluation of the reliability, validity, and global fit of the postulated reflective measurement models, secondgeneration criteria within the scope of confirmatory factor analysis - taking account of measurement errors – should be consulted (Weiber & Mühlhaus, 2014, p. 199). Model optimization is based on less restrictive latent-state measurement models as, on the one hand, they allow for a straightforward detection of causes for model misfit from the ground up. On the other hand, subsequently, the modified measurement models can be easily extended to the more specific structural models according to the research questions of the present study by means of reparameterization. Reflective models are typically evaluated on grounds of their indicator and factor reliability, average variance extracted, and confirmatory goodness-of-fit criteria (Weiber & Mühlhaus, 2014, p. 130). The latter indicate in how far the theoretically postulated variance-covariance matrix differs from the empirical one (Geiser, 2010, p. 60). The aim lies in finding the best possible model, which predicts the empirical variance-covariance matrix accurately with highest possible parsimony (Weiber & Mühlhaus, 2014, p. 201). Table 6.2 summarizes the consulted fit indices along with a short description and the assumed cut-off values. There is no unanimously agreed on cutoff for the criteria presented above, so that the recommendations are rather rules of thumb. Hu & Bentler further argue that for large samples with n > 250, as in this study, recourse to TLI, CFI and SRMR is recommended to minimize type I and type II errors (1999, p. 28). The following analyses also include further conventional fit indices to avoid admitting a more parsimonious, but inappropriate model. For reasons of brevity, only the later modified measurement models will be depicted in the main part of this thesis (see section 7.5). The unmodified measurement models with a priori assignments of indicators to the expectancy, value, and emotion constructs, which will be the starting point of the following model evaluation can be found in Appendix 7 in the electronic supplementary material. Table 6.3 shows the fit indices of the measurement models in Appendix 7. For a more concise presentation, the fit indices are based on groups of constructs

6.4 Global Goodness-of-fit of the Unmodified Measurement Models

179

Table 6.2 Overview of Quality Criteria and Cut-Offs Underlying the Present Study Description

Cut-off

χ2 -test (Geiser, 2010) • Tests the null hypothesis if the theoretically implied covariance matrix does not fit significantly worse than the empirical matrix of the population with unrestricted correlations a

H0 : implied population (co)variance matrix = population (co)variance matrix (unrestricted correlations) p > .05; but with greater sample, theoretically negligible deviations between model and data might increase the sensitivity to misspecifications (Moosbrugger & Keleva, 2020, p. 648)7

χ2 difference test (Geiser, 2010) • Used for tests of measurement invariance • Model comparison of two nested models (i.e., model with additional restrictions compared to the general model) • Difference between χ2 and degrees of freedom will be tested for statistical significance

H0 : Restricted model = unrestricted model p > .05 (more restricted model does not fit significantly worse to the data than the less parsimonious model)

Root-Mean-Square-Error-of-Approximation (RMSEA) (inference-statistical) • Root of the squared deviation between model and data per degree of freedom • Assessment of the misfit in comparison to a saturated ideal model under consideration of the model complexity (df) and sample size

≤.05: excellent/close ≥.05 ≤ .08: reasonable/acceptable >.10: insufficient (Browne & Cudeck, 1992; Hu & Bentler, 1999, p. 27) H0 : RMSEA ≤ .05

Standardized-Root-Mean-Square-Residual (SRMR) (absolute/descriptive) • Square root of average squared residuals ≤ .06: good (observed matrices) subtracted by ≤ .08: acceptable model-implied correlation matrices under (Hu & Bentler, 1999, p. 27) consideration of variable scaling • Low values indicate that the model accounts adequately for the observed correlations between variables (continued)

Normed χ2 (χ2 /df) is not considered as it is also sensitive to sample size only for incorrect models and as there are no unanimously agreed on cutoffs (Bollen, 1989, p. 278; Kline, 2015, p. 272).

7

180

6

Evaluation of the Unmodified Measurement Models

Table 6.2 (continued) Description

Cut-off

Comparative-Fit-Index (CFI) & Tucker-Lewis-Index (TLI) (incremental) • Comparison of the χ2 of a specified target, default model with that of the more restrictive independence model (uncorrelated variables and in which each variable was represented by a separate factor) under consideration of model complexity (degrees of freedom) (CFI/TLI), and distributional deviations (CFI) • Used to compare different models with same constructs, but different paths • Indicates the share to which target model surpasses the baseline model fits on grounds of the χ2 value, i.e., minimum of the discrepancy function

The higher the value, the better the fit of the target model compared to the baseline model. Homogeneous items (highly correlated): CFI/TLI ≥ .97: good CFI/TLI ≥ .95: acceptable Heterogeneous items (moderately correlated): CFI/TLI ≥ .95: good CFI/TLI ≥ .90: acceptable In this study, inter-item correlations range from .4–.8. Moreover, some factors have indicator-specific variance. Hence, the cut-offs for the heterogeneous items will be used as reference. (Hu & Bentler, 1999, p. 27; Weiber & Mühlhaus, 2014; Moosbrugger & Kelava, 2020, p. 650)

Information criteria (AIC, BIC; Weiber & Mühlhaus, 2014, p. 218) Aim: High share of explained variance under consideration of model parsimony and sample size; high complexity is penalized • Allows model comparisons with different causal relations, different numbers of constructs and parameters • AIC = χ2 in relation to parameters (higher complexity functions as penalty) • BIC = additionally considers sample size (higher penalty of complexity than AIC)

No cutoffs available; models with lowest AIC and BIC (under consideration of the other fit criteria) should be chosen because then, the model is a simplification of the data and not a mere adaptation of the model to the data.

according to expectancy, value8 as well as course, and learning-related emotions 8

Effort is in this case considered to be a value construct based on Ramirez et al. (2012) and Wigfield’s and Eccles (2002) conceptualization of effort as subcategory of subjective task value along with cost. In the structural models, effort also be considered in the context of expectancy and emotion constructs since it also may be a representation of actual performance behavior influenced by those anticipated self-appraisals according to the SATS-M model (Ramirez et al., 2012).

6.4 Global Goodness-of-fit of the Unmodified Measurement Models

181

combined for all respective measurement occasions. Expectancy and value constructs in particular will be investigated separately first and in contrast to the theoretical EV framework. This is due to the fact that the original EV models, taken together, have a very bad fit and a non-positive definite covariance matrix. Hence, the separate investigation of expectancy and value constructs renders it easier to pinpoint need for optimization based on flawless model estimation. Table 6.3 Fit Indices of the Unmodified Measurement Models Category

Expectancy

Value

Course Emotions

Learning Emotions

Constructs

S, D

V, I, A, E

Jc , Hc

JL , HL

# of occasions

3

3

4

2

χ2

2,839.96c

9,338.99c

5198.41c

262.52c

df

480

1824

1052

293

SCF [%]

13.5

1.9

15.2

23.9

RMSEA

.057c

.052c

.051

.072c

90% C.I.

.057–.059

.051–.053

.049–.052

.070–.075

SRMR

.083

.083

.081

.102

CFI

.767

.754

.856

.857

TLI

.744

.737

.845

.841

AIC

163,407.47

268,728.93

177,341.074

150,222.21

BIC

167,371.82

271,601.45

182,144.317

153,467.63

Notes. df = degrees of freedom, SCF = scaling correction factor

Significant χ2 p-values indicate that null hypothesis, assuming an equal fit between the implied and empirical covariance matrix, should be rejected. This is the case for all four measurement models. The χ2 value however is sensitive to higher sample size and number of manifest variables (Moosbrugger & Kelava, 2020, p. 648). Except for the RMSEA, the other fit indices are below acceptable for all measurement models. Particularly the expectancy and value measurement models do not fit well to the data as indicated by the CFI and TLI values. While the RMSEA and SRMR for the expectancy and value models are equal or even better compared to unmodified models from other studies using unparcelled CFA, the CFI and TLI are considerably lower and far below the recommended cutoffs (e.g., Emmioglu et al., 2018; Persson et al., 2019; Shahirah & Moi, 2019; Xu & Schau, 2021). This is a first indication that the EV constructs require a more thorough investigation with regard to the overall model

182

6

Evaluation of the Unmodified Measurement Models

performance. The emotion-related measurement models have comparably better CFI and TLI values but could still be improved9 . The scaling correction factor indicates that the ML χ2 is approximately 10 to 25% higher than the scaled χ2 from the MLR estimation. There is no agreed-on threshold for a problematic percentage, but they suggest that the dataset is not multivariate normally distributed. The learning-related measurement model has a very high scaling correction factor, which may be due to strong skew and kurtosis of some of the hopelessness and enjoyment indicators (see section 5.4.3). These model results and further anticipated decrease of model fit due to time invariance restrictions indicate the need for thorough evaluation and optimization of the measurement models. Low standardized coefficients of unparcelled CFAs serve as a basis to determine need for optimization on the individual item-level. Therefore, the squared standardized coefficients, or rather, indicator reliabilities, will be checked first to ensure that each item precisely measures the alleged construct.

6.5

Item-specific Analyses for the Unmodified Measurement Models

6.5.1

Indicator Reliabilities

First-generation reliability criteria (i.e., Cronbach alpha, item to total correlation, inter-item correlations; Weiber & Mühlhaus, 2014, p. 142) will not be considered due to the confirmatory context of the analyses (Bagozzi & Yi, 2012). Moreover, they do not allow for an estimation of the measurement error and for an inferencestatistical inspection of the model parameters (Weiber & Mühlhaus, 2014, p. 143). Second-generation reliability criteria are based on a comparison between the variance of an indicator and the variance of the measurement error. The higher the share of explained variance, the better the reliability and the higher the share of common variance between measured and true value (Weiber & Mühlhaus, 2014, p. 146). The indicator reliability indicates results from the squared standardized coefficients (Bagozzi & Baumgartner, 1994; Weiber & Mühlhaus, 2014, p. 150), whereby high values suggest that the manifest variable measures the respective latent construct reliably. The residual is assumed to be caused by measurement errors. Chin (1998b, p. 325) argues that, for scales at an early stage of development, reliabilities of .25–.35 are also acceptable. Since the scales in the present 9

Model fit for the emotion measurement models will not be balanced against other studies due to a different selection of constructs.

6.5 Item-specific Analyses for the Unmodified Measurement Models

183

study have already been validated on various occasions (see section 5.3), a higher cutoff has to be applied. Literature generally suggests that values above .40 are acceptable (Bagozzi & Baumgartner, 1994; Weiber & Mühlhaus, 2014, p. 150). However, indicator reliabilities higher than .50 (and thus standardized coefficients >.707) are considered more meritorious because then, the indicator shares more variance with the respective latent construct than with the error variance (Chin, 1998a, p. 8). Table 6.4 presents the individual reliabilities for all original items. Table 6.4 Indicator Reliabilities of All Original Items t1

t5

t9

t1

t5

t9

t2

t4

t6

t9

s1

.45

.49

.54

s2

.59

.47

.55

s3

.30

.32

s4

.48

.44

s5

.39

d1 d2

t3

t7

i1

.39

.38

.39

i2

.72

.67

.77

jc 1

.66

.73

.75

.74

jc 2

.42

.46

.46

.35

jL 1

.62

.62

jL 2

.15

.19

.42

i3

.63

.61

.55

i4

.67

.63

.69

jc 3

.44

.41

.31

.66

jc 4

.57

.64

.63

.28

jL 3

.25

.37

.64

jL 4

.41

.46

.52

v1

.32

.35

.39

jc 5

.74

.76

.54

.77

.80

jL 5

.72

.75

.34

.56

.57

v2

.11

.15

.25

v3

.46

.54

.57

jc 6

.70

.64

.67

.68

jc 7

.65

.73

.73

.74

jL 6

.75

.73

.68

.71

.68

jL 7

.69

d3

.52

.42

.49

v4

.47

.52

.58

hc 1

.74

.44

.48

.53

.49

jL 8

.33

.42

d4

.55

.50

d5

.13

.09

.43

v5

.16

.14

.19

.14

v6

.49

.35

.30

hc 2

.33

.52

.63

.55

hL 1

.43

.47

hc 3

.64

.58

.61

.60

hL 2

.44

d6

.12

.17

.23

v7

.17

.17

.52

.18

hc 4

.75

.69

.71

.75

hL 3

.66

.66

a1 a2

.23

.35

.33

v8

.43

.35

.47

.44

e1

.51

.42

.44

hc 5

.73

.71

.72

.78

hL 4

.77

.79

.44

.38

hL 5

.70

a3

.43

.43

.47

e2

.71

.87

.66

.81

a4

.65

.59

.62

e3

a5

.57

.40

.43

.61

.30

.55

a6

.30

.18

.25

Note. All indicator reliabilities are significant at the .01 level.

Table 6.4 shows that, even taking the lower bound of .40 as a basis for sufficient reliability, some indicators seem problematic. Therefore, each dimension will be investigated separately regarding their individual value and content. Expectancy constructs For self-efficacy, most indicator reliabilities are greater than .40, but also below the upper recommended bound of .5. The indicator s3 has rather weak reliabilities

184

6

Evaluation of the Unmodified Measurement Models

on all occasions but is germane to the construct as regards content (“I can learn statistics.”). The difficulty indicators provide a clearer picture, since d2, d5, and d6 show very weak reliabilities with a low explanatory power .70) and stand in contrast to the values of jC 2 and jC 3. The five indicators with lower values also had a moderate to high ceiling effects (see section 5.4.3), which might be related to the indicator formulation and has to be investigated later. Hopelessness while learning outside the course has mostly appropriate reliabilities above the lower bound of .40, while hL 1 has values < .50 considerably smaller values than the other four indicators. The same pattern applies to course hopelessness, where only hc 1 and hc 21 have smaller reliabilities than the other three indicators. The potential reasons for the low indicator reliabilities, in particular considering the EV constructs (i.e., difficulty, value, and affect), must be further investigated. Indicators with a low reliability may be (1) unreliable, (2) influenced

6.5 Item-specific Analyses for the Unmodified Measurement Models

185

by other effects or (3) capture different aspects of a multidimensional construct, pointing to a formative measurement model (Chin, 1998a). These aspects underline that ensuring the theory-compliant item-factor assignment however is an important prerequisite for interpreting the factors appropriately. Cross-loadings may thereby stem from item-construct relations that cannot be accounted for by the theoretically postulated model and could be a major issue for inappropriate goodness-of-fit as well as weak indicator reliabilities (Ozkok et al., 2019; Mai et al., 2018; Marsh et al., 2013). In the confirmatory framework, such crossloadings cannot be explicitly detected because the definition of measurement models fixes all indicator loadings on the other constructs at zero. In other words, the CFA model already assumes that the respective variables are perfect measures of only the one factor they are assigned to (Schmitt & Sass, 2011, p. 109). An item-level exploratory analysis is also recommended because most studies using the SATS have used parceling, which might veil individual item malfunctioning (see section 5.3.2). The AEQ has moreover been adapted to statistics education for this study, to preclude that changes from the adaptation compromise measurement performance (Fierro-Suero et al., 2020, p. 8). Therefore, in the next chapter, the item-level analyses will shift from the confirmatory to the exploratory level to unveil potentially problematic cross-factor loadings.

6.5.2

Dimensionality of the Item Structure

The aim of the following EFA is to ensure that a sufficiently high loading of each indicator only on the construct to which they were assigned conceptionally. In terms of magnitudes, there is some agreement in literature that indicator loadings on the wished-for construct should be greater than .40 or .50 whereas crossloadings on other factors should not be greater than .40 (Homburg & Giering, 1996, p. 12) to achieve simply structured and saturated factors. Ferguson & Cox (1993, p. 91) also recall that cross-loadings may represent an actual conceptual overlap between the constructs, in which case the deletion should be scrutinized in terms of content. If the scales are however required to be psychologically distinctive, the difference between both cross-loadings should be considered. If the difference is small (< .20), a deletion of the item should be considered because its actual factor affiliation is ambiguous (Ferguson & Cox, 1993; Gangire et al., 2020). Hair et al. argue that items should only be deleted if it improves the composite reliability and AVE (2014).

186

6

Evaluation of the Unmodified Measurement Models

The factor structures will be investigated following the recommendation of Weiber and Mühlhaus (2014, p. 132), whereby several indicator sets are investigated simultaneously according to the expectancy, value, learning and course emotions for reasons of brevity. Each set of construct will be tested for a oneto multi-factor solution to check whether constructs represent lesser or more constructs than theoretically assumed based on the theoretical data. The MLR estimator is selected along with the default oblique GEOMIN rotation under the assumption that factor indicators load on more than one factor. GEOMIN rotation maximizes cross-loadings for certain items while minimizing others and was found to be very close to Thurstone’s simple structure pattern (Schmitt & Sass, 2011, p. 107). The choice of an oblique rotation assumes higher inter-factor correlations, which applies to the correlations between the constructs assessed in this study. Moreover, GEOMIN is recommended for well-developed measures such as the AEQ and SATS where fewer and smaller cross-loadings are expected (Schmitt & Sass, 2011, p. 107; see section 5.3). Expectancy constructs The expectancy constructs entail the factors self-efficacy and difficulty. Table 6.5 shows the EFA results from Mplus. Table 6.5 Model Fit of the Original Expectancy Constructs χ2

df

RMSEA

RMSEA C.I.

CFI

TLI

SRMR

3.267

1124.89c

44

.138c

.131–.145

.574

.467

.114

t5

3.576

1104.85c

.142c

.135–.149

.578

.473

.108

t9

3.828

915.70c

.168c

.158–.177

.564

.455

.130 .040

Eigen-value 1-factor-solution t1

2-factor-solution [theoretically assumed] 1.982

294.31c

.077c

.069–.086

.897

.834

t5

1.903

282.34c

.078c

.070–.087

.901

.840

t9

2.147

212.14c

.086c

.075–.096

.911

.856

.064c

.055–.074

.948

.886

.028

t1

34

3-factor-solution 1.094

155.96c

t5

1.076

96.33c

.049

.039–.059

.971

.937

.023

t9

1.018

174.18c

.092c

.079–.105

.925

.836

.033

t1

25

Note. Eigenvalue refers to the eigenvalue of the sample correlation matrix for the indicated number of extracted factors.

6.5 Item-specific Analyses for the Unmodified Measurement Models

187

The parameter estimates and the Kaiser criterion (i.e., factor extraction as long as the eigenvalues are greater than 1) suggest that only a 3-factor-solution begins to yield adequate fit indices with the original set of indicators. The rotated factor matrix of the 1-factor solution underlines that self-efficacy and difficulty are different constructs because the indicators d2-d6 only have a weak loading on the general factor. Despite the better fit, the 3-factor solution extracts an additional factor for s3 and s4, which is likely due to their positive formulation, while s1, s2, and s5 are negatively worded. A factor with only two items however is considered problematic in terms of model identification. Moreover, the 3-factor solution does not solve the problem with the weak loadings of d2, d5 as well as d6 and still has inadequate RMSEA and TLI values (see Appendix 5 in the electronic supplementary material). These findings suggest that choosing another factor solution than the theoretically assumed one does not compensate the problematic fit and brings along other issues. Therefore, only the rotated factor matrix of the 2-factor-solution will be considered in Table 6.6 to identify cross-loadings. Table 6.6 GEOMIN Rotated Factor Loadings for the Original Expectancy Constructs S1 s1

.647

s2

.759

s3

.615

s4

.735

s5

.596

d1

−.287

D1

S5 .572 .211

.483

D5

S9

D9

−.238

.597

−.274

.611

.704

.728

.799

.745

.796

.501

−.422

−.255

.673

.722

.541

−.421

d2

.318

.364

.543

d3

.735

.720

.720

d4

.748

.703

.615

d5

.369

.304

.379

d6

.294

.390

.426

Note. All loadings between −.2 and .2 are left blank.

While the standardized coefficients were adequate for all indicators, except for s3, the rotated loadings show that s1 and s5 have cross-loadings on the difficulty factors that increase over time, combined with a rather weak loading on the alleged construct. While the cross-loading for s1 is still tolerable, s55 and s95

188

6

Evaluation of the Unmodified Measurement Models

exceed the above-mentioned .20 threshold between cross-loadings. From a content perspective, the formulation of these indicators places a strong emphasis on the appraisal of difficulty10 . The indicator s3 had the smallest standardized coefficients but is adequately distinct from difficulty (cross-loadings < .20 between factors and >.40 on the target construct). For the difficulty construct, the indicators d2, d5, and d6 have weak factor loadings, which are mostly below the recommended threshold of .4. These findings coincide with other studies that found very low factor loadings for some of the difficulty indicators (Bechrakis et al., 2011; Carnell, 2008; Tempelaar, van der Loeff, et al., 2007; Vanhoof et al., 2011). The items might not fit sufficiently well into the difficulty construct as they also involve appraisals on general, thirdperson preconceptions about statistics as allegedly technical or unfamiliar subject (point of view of “most people”). This indicates that a differentiation between the more concrete individual’s sense of difficulty, reflected in the other three indicators, and a commonly believed appraisals on difficulty might be necessary. The three indicators also had very small, indicator reliabilities and should be considered for removal. Value constructs The same procedure is applied to the value constructs, comprising the constructs interest, value, affect, and effort. The 1- and 2-factor-solutions will not be considered due to very low fit indices. The 5-factor solution did not converge at t5 and t9 . Hence, only the 3- and 4-factor solution will be juxtaposed in Table 6.7. The GEOMIN rotated factor matrix for the 3-factor solution determines one factor for interest and utility value appraisals (interest & value), the second and third factor can be ascribed to effort and affect (see Appendix 5 in the electronic supplementary material). While the value-interest factor yields a simple structure with occasionally weak loadings (< .60), the theoretical distinction between interest and value is too relevant within the EV framework (see section 3.2.7), so that the factors will not be merged. Moreover, the 3-factor solution has a considerably worse fit compared to the 4-factor model. The goodness-of-fit of the theoretically assumed 4-factor-solution is still below the recommended cut-off values. Since none of the proposed 1- to 5-factor solutions yield adequate fit and occasionally had issues in the estimation process, optimization is necessary with a focus on the theoretically assumed 4-factor structure. The 4-factor solution is depicted in Table 6.8. 10

Realigning these indicators to the difficulty construct deteriorated their indicator reliabilities and overall model fit considerably.

6.5 Item-specific Analyses for the Unmodified Measurement Models

189

Table 6.7 Model Fit of the Original Value Constructs χ2

df

RMSEA

RMSEA C.I.

CFI

TLI

SRMR

1.888

1841.44c

150

.094c

.090–.098

.810

.734

.051

t5

1.877

2052.56c

.102c

.098–.106

.772

.681

.059

t9

1.927

1136.44c

.097c

.091–.102

.814

.739

.051

Eigen-value 3-factor-solution t1

4-factor-solution [theoretically assumed] t1

1.356

924.58c

.068

.064–.073

.911

.858

.033

t5

1.353

154.56c

.094

.090–.098

.831

.731

.044

t9

1.273

773.80c

.083

.077–.089

.879

.807

.039

132

Table 6.8 GEOMIN Rotated Factor Loadings for the Original Value Constructs I1

V1

E1

A1

I5

V5

A5

E5

I9

i1

.703

.713

i2

.914

.727

i3

.704

.537

.337

.671

i4

.697

.640

.312

.719

V9

.396

.424

.497

v2

.631

.680

.677

v3

.856

.821

.813

v5

.774 .280

.760

.186

.495

.713

v7

.242

.343

.160

v8

.499

.274

.382

.272

.796 −.210

.020

v6

.348

.632

.188 .592

.328 .205

.130 .556

e1

.730

.488

.589

e2

.922

.576

.934

e3

.765

.684

a1

.390

.326

.633

A9

.909

v1

v4

E9

.675

.239

.718 .584

.222 (continued)

190

6

Evaluation of the Unmodified Measurement Models

Table 6.8 (continued) A1

I5

a2

I1

.547

.325

a3 a4 a5 a6

.218

V1

E1

V5

A5

E5

I9

V9

E9

A9

.553

.592

.664

.674

.723

.825

.821

.771

.716

.209

.238

.208

.470

.270

.816 .695 .229

.273

Note. All loadings between −.2 and .2 are left blank

As can be seen in Table 6.8, the interest and effort constructs are not affected by cross-loadings, except for i3 and i4, which have a loading of approx. .30 on the affect construct at t5 . Most obviously, v5 and v711 have a loading < .40 on all constructs. These two indicators have a low factor loading on their own alleged construct, but a higher loading on the interest construct12 . An indicator with higher loadings on other constructs than the designated one is an indication of lacking discriminant validity (Chin, 1998b, p. 321). The items v1 and v8 are less problematic than v5 and v7, but also have consistently weak loadings and one loading < .40 each at t1 and t5 , respectively. Moreover, both indicators have cross-loadings; for v1, the cross-loading difference is slightly above the threshold of .20 at t1 and t9 , v8 has cross-loading that clearly surpasses the threshold at t5 13 . Concerning the affect construct, a1 loads more strongly on the interest construct (“I like statistics”) due to its intrinsic notion. The item a6 has a similar intrinsic notion translating into cross-loadings with the interest and effort constructs below the difference threshold of .20 (“I like mathematics”). The item a52 has a crossloading on the interest construct with a difference only slightly greater than .20 compared to the alleged factor loading but seems sufficiently distinctive at the

11

The item v7 is an item that was added to the original scale for test purposes, which may also explain that it is less related to the original construct. 12 “I use statistics in my everyday life” (v5) and “Statistics is helpful to better understand other modules” (v7) might indeed involve aspects of intrinsic motivation in terms of enjoying using statistics in everyday life and study situations. 13 Due to the heterogeneous factor loadings of value, an EFA was conducted to check for a potential multidimensionality (similar to Xu & Schau, 2021). However, more-factor solutions of value were deemed inappropriate because each newly extracted factor only consisted of one indicator with high loading and did reduce neither weak loadings nor cross loadings of the one-dimensional solution.

6.5 Item-specific Analyses for the Unmodified Measurement Models

191

other measurement occasions. Hence, a1 and a6 should be considered as firstpriority candidates for removal. Learning-related emotion constructs According to the previously analyzed factor- and indicator-reliabilities, AVE, and Fornell-Larcker criterion, the emotion constructs already seem to perform adequately. Global model fit (CFI, TLI and RMSEA in particular) however still reveal potential for optimization, so that the indicators also undergo a factor analysis. First, Table 6.9 opposed the model fit of the 1- to 3-factors solutions, whereby enjoyment and hopelessness while studying for the statistics course are supposed to be two-dimensional. Table 6.9 Model Fit of the Original Learning-Related Emotion Constructs Eigen-value

χ2

df

RMSEA

RMSEA C.I.

CFI

TLI

SRMR

2778.28

65

.176c

.171−.182

.597

.516

.144

.192c

.185−.198

.590

.508

.149

.091c

.085−.097

.913

.872

.057

.106c

.099−.113

.897

.849

.059

.035

.027−.043

.990

.981

.014

.041

.032−.050

.955

.977

.013

1-factor-solution t3 t7

5.644 6.233

2661.94

2-factor-solution [theoretically assumed] t3 t7

2.216 2.244

64.62

53

702.10

3-factor-solution t3

1.289

112.26

t7

1.258

72.50

42

With the original constellation of items, the model with the general factor has an unsatisfactory fit. The 3-factor solution yields acceptable fit, but the third extracted factor only includes three moderately loaded items which had weak factor loadings in the 2-factor solution (see Appendix 5 in the electronic supplementary material). However, the rotated 3-factor loadings yield even higher cross-loadings for jL 4 and jL 8, and hL 1 compared to the 2-factor solution, so that the three factors are neither distinguishable nor interpretable. The GEOMIN rotated loadings of the two-factor solution in Table 6.10 are therefore considered as a starting point to optimize the theoretically underlying factor structure. For learning-related enjoyment, five indicators have high loadings on one factor and weak loadings .6 representing the commonly accepted, satisfactory cutoff (Bagozzi & Yi, 2012; Weiber & Mühlhaus, 2014, p. 150). The omission of indicators with lower loadings is advised when it increases or does not change composite reliability (Cortina, 1993)14 . Mplus has no options to automatically output composite reliability and average variance extracted, so that the following formulae were programmed manually into model constraints. When factor loadings are standardized and factor variance is set to 1, which has to be done explicitly in Mplus, the factor reliability is computed as follows (Weiber & Mühlhaus, 2014, p. 151): 

C R(ξ j) = 

(λi j)2 .  (λi j)2 + θii

with: λi j = estimated f actor loading i on f actor j θii = estimated variance o f the err or variables The composite reliability is thus computed as sum of factor loadings i on factor j, all squared, divided by the sum of factor loadings, all squared, plus the sum of error variances. The factor determinacy indicates the quality of the factor estimation ranging from [0;1] with 1 representing a perfect correlation between the estimated factor score and the true factor score. Conversely, indeterminacy would imply that different sets of factor scores vary in such a way that individuals may be ranked arbitrarily high or low within their respective constructs. High factor determinacy 14

Cronbach’s alpha values will be omitted in this confirmatory context since composite reliabilities put equal weight on all items loading on the construct while coefficient alpha is a function of the actual factor loadings, so that it increases with the number of indicators assigned to the latent construct (Xu & Schau, 2019).

196

6

Evaluation of the Unmodified Measurement Models

depends on the height of the communalities, i.e., the squared correlations of the individual indicators on each factor (Brown, 2006). The results for the dimensions are shown in Table 6.13. Table 6.13 Composite Reliabilities and Factor Determinacies of the Original Constructs S1

S5

S9

D1

D5

D9

I1

I5

I9

V1

V5

V9

CR

.796

.795

.841

.691

.712

.674

.857

.841

.869

.832

.832

.844

FD

.909

.918

.931

.884

.902

.908

.944

.946

.956

.939

.951

.954

A1

A5

A9

E1

E5

E9

J3L

J7L

H3L

H7L

CR

.809

.798

.805

.852

.719

.801

.878

.902

.882

.895

FD

.927

.935

.935

.952

.910

.937

.961

.965

.952

.956

J2C

J4C

J6C

J9C

H2C

H4C

H6C

H9C

CR

.912

.921

.918

.910

.870

.881

.900

.896

FD

.964

.970

.972

.971

.950

.949

.956

.957

Note. CR = composite reliability, FD = factor determinacy

Composite alpha reliabilities and factor determinacies are adequate for most constructs. The emotion constructs have the highest composite reliabilities and factor determinacies (approx. .90 and .96, respectively). Self-efficacy, value, interest, effort, and affect also have appropriate values while the difficulty construct has the lowest values of composite alpha but is still greater than the threshold of .6. The lower reliability of difficulty conforms to the findings of other studies, in which it was also mostly one tenth less than that of the other constructs (Schau, 1995; Schau & Emmioglu, 2012; Stanisavljevic et al., 2014). This aligns with the low indicator reliabilities for the difficulty construct. The difficulty indicators also have very low inter-correlations among each other, partly even close to zero, which partly contradicts the assumed reflective nature of the construct15 . In all, despite the occasionally low indicator reliabilities, the composite score suggests an adequate reliability of all construct measurements. Apart from that, the validity of the measurements, for which the reliability is a necessary requirement, needs to be factored in the considerations for optimization. Construct validity is given if construct measurement is not distorted by other constructs or systematic errors and consists of nomological, convergent

15

Implications for further analysis of this finding will be considered in the final evaluation of the measurement models (section 7.5).

6.6 Construct-level Analyses for the Original Measurement Models

197

and discriminant validity (Weiber & Mühlhaus, 2014, p. 160). The overall goodness of fit thereby already gives a first indication on the construct validity in such a way that the matching between data and hypothesized measurement models are assessed (Hair et al., 2014). These analyses in section 6.4 have shown that the global fit of the models based on the original number of items needs optimization. Moreover, convergent validity can be assumed to be given if two maximally different methods yield measuring results corresponding with each other, while discriminant validity is given when measurement results of different constructs are significantly different from each other (Weiber & Mühlhaus, 2014, p. 162). In research practice, the use of maximally different methods for the evaluation of reliability and validity is often not feasible (Weiber & Mühlhaus, 2014, p. 164). Hence, instead of using maximally different methods, the concept of multiple items measuring the same construct (even though with the same method) is often used as approximation from which the absence of convergent and discriminant validity can be inferred by means of the average variance extracted and the Fornell-Larcker criterion (Fornell & Larcker, 1981; Hair et al., 2011; Weiber & Mühlhaus, 2014, p. 164). These two criterions also give information on whether the constructs are sufficiently distinct from each other.

6.6.2

Average Variance Extracted

The average variance extracted refers to the percentage of the variance of the latent construct that is explained on average across the indicators of the respective construct (Fornell & Larcker, 1981). The AVE is computed as follows, when factor loadings are again standardized, and factor variance is set to 1 (Weiber & Mühlhaus, 2014, p. 151):  2  λi j AV E(ξ j) =  2   λi j + θii with: λi j = estimated f actor loading i on f actor j θii = estimated variance o f the err or variables If the AVE is greater than .5, there seems to be no indication for an absence of convergent validity because then, the shared variance of the indicator variables is

198

6

Evaluation of the Unmodified Measurement Models

greater than the measurement error (Bagozzi & Yi, 1988). Table 6.14 shows the AVE of all constructs in their original constellation. Table 6.14 AVE of the Original Factors AVE per factor S1

S5

S9

D1

D5

D9

I1

I5

I9

V1

V5

V9

.441

.438

.515

.294

.314

.355

.603

.573

.627

.393

.396

.416

A1

A5

A9

E1

E5

E9

J3L

J7L

H3L

H7L

.421

.403

.422

.660

.467

.580

.488

.544

.603

.633

J2C

J4C

J6C

J9C

H2C

H4C

H6C

H9C

.599

.628

.620

.598

.578

.587

.616

.608

Even though the mostly values increase over time, the EV constructs seem problematic as most of them have an AVE below .5. Contrary to the acceptable composite reliabilities, these values suggest that there is potential for optimization; AVE is low for those constructs that have a few indicators with low reliabilities (self-efficacy, difficulty, value, affect), suggesting that they lack convergent validity (Weiber & Mühlhaus, 2014, p. 164). The AVE for all emotion constructs yields acceptable values >.50, except for J3L . The Fornell-Larcker criterion will be investigated next to check whether lacking distinctiveness of the constructs might be another issue of the measurement.

6.6.3

Fornell-Larcker Criterion

The Fornell-Larcker criterion is used to assess the discriminant validity of the constructs, i.e., the extent to which the indicator variables can be discriminated from those of another construct16 . Therefore, the shared variance between the construct and its indicators should be greater than the variance that the construct shares with other factors and their respective variable bundles. The criterion is fulfilled if the root AVE of the latent construct is greater than the correlations Weiber and Mühlhaus (2014) also suggest using a χ2 difference test with an unrestricted model (free estimation of factor correlations) and a restricted model (factor correlation of two constructs fixed at one) to test whether both constructs measure the same concept. These tests were insignificant in all constellations for this study, so that the specific results will not be depicted. While this is a first sign of discriminant validity, the focus of the chapter is on i, which is a more informative criterion regarding the degree of discriminant validity.

16

6.6 Construct-level Analyses for the Original Measurement Models

199

between the constructs at question and the other relevant constructs (Fornell & Larcker, 1981). 

AV E(ξ j) > r (ξ i, ξ j)

(6.1)

Table 6.15 shows the factor correlations and the square-root of the AVE on the diagonals to allow for a direct comparison. The root AVE of each construct will be compared to the correlation of the constructs at the same measurement occasion (e.g., root AVE of S1 with underlined factor correlation between S1 and D1 at t1 : .664 > −.301). Table 6.15 Root AVE and Factor Correlations of the Original Expectancy Constructs D1

D5

D9

S1

S1 .664

S5 .537

S9 .536

−.301

−.280

−.200

S5

.537

.662

.756

−.180

−.484

−.408

S9

.536

.756

.718

−.146

−.344

−.418

D1

−.301

−.180

−.146

.542

.483

.455

D5

−.280

−.484

−.344

.483

.560

.676

D9

−.200

−.408

−.418

.455

.676

.596

Based on the Fornell-Larcker criterion, there is no indication of an absence of discriminant validity for the self-efficacy and difficulty constructs, except for one occasion: The root AVE of D5 (.560) is only slightly greater than the intercorrelation between S5 and D5 (−.484) according to amount. Inter-construct correlations between self-efficacy and difficulty at the same measurement occasion are moderate (between .30 and .40). Correlations between the same factors over time become most pronounced at t5 and t9 , which is a first indication of a growing stabilization of the expectancy appraisals by the end of the semester. Table 6.16 shows the AVE for the value components. The Fornell-Larcker criterion of the value and interest constructs seems to be a major issue. The root AVE for I5 and I9 are only slightly higher than the correlations between I5 and V5 as well as I9 and V9 (.757 > .749 and .792 > .771). The root AVE of the value constructs is even smaller than the value-interest correlation at the same measurement and all three occasions (.627 < .677, .629 < .749 and .645 < .771). This suggests an absence of discriminant validity which has already been foreshadowed by the cross-loadings between both constructs (see section 6.5.2). The other inter-construct correlations are moderate and in

200

6

Evaluation of the Unmodified Measurement Models

Table 6.16 Root AVE and Factor Correlations of the Original Value Constructs I1

I5

I9

V1

V5

V9

A1

A5

A9

E1

E5

E9

I1

.777 .581 .522 .677 .456 .410

.337

.339

.284

.301

.055

.093

I5

.581 .757 .715 .468 .749 .549

.186

.482

.414

.241

.321

.224

I9

.522 .715 .792 .436 .561 .771

.143

.459

.526

.212

.154

.219

V1 .677 .468 .436 .627 .621 .561

.288

.283

.236

.235

.018

.052

V5 .456 .749 .561 .621 .629 .762

.125

.435

.321

.166

.142

.097

V9 .410 .549 .771 .561 .762 .645

.053

.333

.417

.153

.067

.165

A1 .337 .186 .143 .288 .125 .053

.649

.649

.560 −.105 −.139 −.142

A5 .339 .482 .459 .283 .435 .333

.649

.635

.826 −.002

A9 .284 .414 .526 .236 .321 .417

.560

.826

.650 −.108

.022 −.106 .035 −.052

E1 .301 .241 .212 .235 .166 .153 −.105 −.002 −.108

.812

E5 .055 .321 .154 .018 .142 .067 −.139

.035

.500

.683

.765

E9 .093 .224 .219 .052 .097 .165 −.142 −.106 −.052

.415

.765

.762

.022

.500

.415

accordance with the Fornell-Larcker criterion. Correlations between affect on the one hand, and value and interest on the hand, increase over time but remains moderate and below the root AVE. Correlations of effort with value and interest decrease over time, suggesting that initial interest and value appraisals at the beginning of the semester may relate to students’ willingness to achieve, but could shift to other cognitive processes while studying during the semester. Again, autocorrelations between constructs increase over time. Most pronounced is the correlation between A5 and A9 (.826), while all other autocorrelations between second and third measurement occasion range between .70 and .80. Finally, Table 6.17 depicts the root AVE for the course and learning-related emotion constructs. For the emotion constructs, there is no indication of lacking discriminant validity, as the diagonal root AVE values are consistently higher than the interconstruct correlations at the same measurement occasion. Correlations between enjoyment and hopelessness both in the learning and course context are moderate. Autocorrelation for the neighboring emotion constructs again increases over time. On grounds of the indications for the lack of convergent and discriminant validity for some EV constructs, it seems advisable to shift back from the construct-level to the indicator-level to scrutinize potential items for removal while prioritizing indicators with a consistently low reliabilities and cross loadings from constructs with an AVE in need of improvement.

6.6 Construct-level Analyses for the Original Measurement Models

201

Table 6.17 Root AVE and Factor Correlations of the Course and Learning Emotion Constructs J2C

J4C

J6C

J9C

H2C

H4C

H6C

H9C

J2C

.773

.679

.616

.525

−.422

−.250

−.284

−.267

J4C

.679

.792

.753

.722

−.254

−.415

−.375

−.301

J6C

.616

.753

.788

.823

−.214

−.351

−.491

−.327

J9C

.525

.722

.823

.777

−.191

−.299

−.406

−.375

H2C

−.422

−.254

−.214

−.191

.758

.486

.435

.395

H4C

−.250

−.415

.351

−.299

.486

.764

.643

.560

H6C

−.284

−.375

−.491

−.406

.435

.643

.783

.597

H9C

−.267

−.301

−.327

−.375

.395

.560

.597

.778

J3L

J7L

J3L

.706

H3L

H7L

.686

−.416

−.395

J7L

.686

.737

−.316

−.446

H3L

−.416

−.316

.665

.543

H7L

−.395

−.446

.543

.675

7

Optimization of the Measurement Models

7.1

Expectancy-value Indicators Considered for Removal or Optimization

In this chapter, as a basis for modeling the structural relations of the postulated FRAME model, the separate parts of the measurement models (expectancy, value, course, and learning emotions) will be evaluated and optimized whenever deemed necessary. For all subsequent comparisons in Section 7.2, the following measurement model names from Table 7.1 apply. An overview for all model identifiers throughout the study can be found in Appendix 4 in the electronic supplementary material. Table 7.1 Identifiers for the Measurement Model Evaluation M0

Unmodified measurement models (for all constructs)

M1

Modified without problematic indicators (for all constructs)

M2

Modified models M1 with additional method factors (for some expectancy-value constructs only)

Decisions on potential deletion will be based on the impact that indicator deletions might have on factor-level composite reliability, AVE, and Fornell-Larcker by repeating the analyses for the reduced variable sets. Deletion of certain inappropriately functioning indicators might be necessary and will be preferred to changing the factor structure, since it did not provide a remedy in reducing weak Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-658-41620-1_7.

© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 A. Maur, Electronic Feedback in Large University Statistics Courses, https://doi.org/10.1007/978-3-658-41620-1_7

203

204

7

Optimization of the Measurement Models

loadings and cross loadings (see Section 6.5.2). This coincides with the findings of Vanhoof et al. (2011) as well as Xu and Schau (2020) in the scope of unparcelled CFA model optimization, where deleting few items had a higher effect on model fit than an alteration of the factor structure or item-factor affiliation. Moreover, the later structural models will be burdened with further invariance restrictions for the longitudinal analyses, which impact model fit. Items will however only be deleted if it is justifiable and consistent in terms of content. Essential items regarding content, which are not performing optimally, could also be modelled as method factors (see Section 7.1.3). Changes in the goodness-of-fit will be considered at the end of each modification1 .

7.1.1

Expectancy Constructs

Each subchapter starts with a table summarizing all indicators in need of improvement2 based on their indicator reliability or cross loadings as a basis for further optimization, starting with Table 7.2. Table 7.2 Problematic Indicator Reliabilities and Cross-Loadings of the Expectancy Models CL

t1

t5

t9

s1

0

.45

.49

.54

s3

+

.30

.32

.42

s5



.39

.46

.52

d2



.11

.15

.25

d5



.13

.09

.14 (continued)

1

For each type of constructs, the items with the lowest reliability were deleted one at a time and re-evaluated within another CFA. To avoid redundancies and to ensure an appropriate flow of reading, descriptions of the incremental results will be omitted if they do not result in a deterioration of the model quality. 2 Bagozzi & Yi (2012) state that occasional low indicator reliabilities within a construct might not affect overall reliability if they are compensated by other, better functioning items. This applies to d11, a12, v56, v96 e91, e53. The indicators are therefore not considered to be problematic. More importantly, the variables in their entirety should adequately reflect the theoretical construct (Bagozzi & Baumgartner, 1994), which will also be considered in the present chapter.

7.1 Expectancy-value Indicators Considered for Removal or Optimization

205

Table 7.2 (continued)

d6

CL

t1

t5

t9



.12

.17

.24

Notes. CL = cross loading; + = indicator has a simple structure at all occasions; 0 = indicator only has minor cross-loading, but a weak factor loading at least at one occasion;— = cross-loading with a difference less than .20, or factor loading < .40

For all constructs, the first reference points will be the AVE and FornellLarcker. For instance, the AVE of self-efficacy and difficulty was more problematic than composite reliabilities, so that it seems more reasonable to reduce potential overlap between both constructs as first priority. Concerning the selfefficacy indicators, no clear decision for deletion can be made since most items either have cross-loadings or low indicator reliabilities. The indicators s1 and s5 have occasionally weak loadings and moderate cross-loadings with difficulty (see Section 6.5.2). This may also be the reason for the small AVE < .5. While omitting only s1 even further decreases the AVE and composite reliability at all measurement occasions, the omission of both s1 and s5 together increases the AVE above the cutoff (except for t5 with .489), which suggests that both indicators together may have confounded the self-efficacy construct with aspects that might rather be difficulty related. Even though s3 had an indicator reliability < .40, the wording is highly relevant for self-efficacy (“I can learn statistics”), so that the removal of s1 and s5 will be considered first to see whether it positively affects the low standardized coefficient of s3 (see Section 7.2.1). For difficulty, the very low AVEs < .35 indicate that the indicators only explain a small portion of the variance of the latent construct. Three difficulty indicators have a very low indicator reliability and cross loadings (d2, d5, d6). As mentioned in Section 6.5.2, these indicators differ from the other three in such a way that rather assess common preconceptions about statistics instead of participants’ own personal appraisals of difficulty. These indicators also had reliabilities below .40 in a range of other studies using ordinal CFA instead of parceled CFA (Hommik & Luik, 20173 ; Persson et al., 2019; Shahirah & Moi, 2019; Vanhoof et al.,

3

Hommik & Luik (2017) were even more rigorous by also deleting the items “Learning statistics requires a great deal of discipline” and “Statistics involves massive computations” (d3 and d4). These indicators might have been inappropriate in their study due to the population of secondary students whose perceptions about statistics could be even vaguer than those of post-secondary students. Accordingly, both indicators function well in the present study considering their high indicator reliability and adherence to the simple structure.

206

7

Optimization of the Measurement Models

2011; Xu & Schau, 2019). Persson et al. (2019) provide a reasonable explanation for the inappropriate matching of d5 (“Statistics is highly technical”) to the difficulty construct because in the 21st century, technology and computers are no longer deemed unfamiliar or difficult and particularly students are comfortable using computers. As all three indicators performed similarly inappropriate in all mentioned studies, it could be assumed that individual sample characteristics of the present study are not responsible, but rather the formulation of the indicators. Moreover, the difficulty indicators yielded very low inter-item correlations, which might suggest an underlying formative concept of the construct. Since formative constructs require to influence other variables to be identified in Mplus, a formative and reflective model were compared only including the three difficulty constructs and the quiz scores. The formative structure of the model with all original six items had a much better fit than the reflective structure (formative model: CFI = .976, TLI = .966, RMSEA = .026, SRMR = .032; reflective model: CFI = .850, TLI = .828, RMSEA = .053, SRMR = .067). These criteria suggest that the assumption of a formative model fits better to the data. In the context of this study, however, formative modeling is hardly feasible as additional specifications have to made for identification in the scope of measurement invariance testing, multiple group analyses, and endogenous positioning (Diamantopoulos & Papadopoulos 2009). These additional specifications for each of the three (or six) indicators at three measurement occasions would highly complexify the model and affect the consistency with the other reflective constructs. Several Monte Carlo simulations on misspecified models suggest that a misspecification of an actually formative measurement model as reflective may inflate unstandardized structural parameters (Jarvis et al., 2003; MacKenzie et al., 2005, p. 728). Both approaches were however also found to not considerably cause differences in the significance of the pathways and thus did not lead to inferences errors (Hagger-Johnson et al., 2011; MacKenzie et al., 2005, p. 721). A simulation study from Chang et al. took another path by using samples from a population in which formative and reflective specifications fitted equally well in the population (2016, p. 3178). The results indicated that suggested that the structural relationships between both modeling approaches were quite consistent when the research measure is conceptualized reflective and fits the data (Chang et al., 2016, p. 3184). In that regard, a study from Eberl and Mitschke-Collande (2006, p. 48) corroborates that misspecifications do not seriously affect structural parameters as long as there are no other estimation problems and if the researcher is mainly interested in the relations between latent constructs rather than in their weighing. Hence, acknowledging that different nuances of some constructs (e.g., difficulty, learning-related emotions) suggest a formative nature of the construct,

7.1 Expectancy-value Indicators Considered for Removal or Optimization

207

the reflective operationalization of difficulty as intended by the scale constructors will be adhered to (Schau & Emmioglu, 2012). Omission of the three indicators contributes to a considerable increase of the AVE. However, AVE still remains below the threshold of .5. Deletion of the items significantly improved goodness-of-fit4 . The RMSEA is appropriate as it is nearly below .05 and its p-value is not significant, so that the H0 (RMSEA ≤ .05) is not rejected. Despite these improvements, CFI, TLI, and SRMR are still below the commonly accepted cutoffs. This issue is taken up again after the refinement of the value constructs by means of indicator-specific factors.

7.1.2

Value Constructs

Analogous to the previous chapter, value constructs will be checked while using problematic indicator reliabilities and cross loadings as a starting point in Table 7.3. Since the effort construct contains three indicators only, no items will be omitted to guarantee model identification and because these low values only occur for one single measurement occasion per indicator. Table 7.3 Problematic Indicator Reliabilities and Cross-Loadings of the Value Models CL

t1

t5

t9

a1



.23

.35

.33

a2

0

.35

.47

.44

a6



.30

.18

.25

i1

+

.39

.38

.39

v1



.32

.35

.39

v5



.16

.14

.19

v7



.17

.17

.18

v8

0

.43

.42

.44

Notes. CL = cross loading; + = indicator has a simple structure at all occasions; 0 = indicator only has minor cross-loading, but a weak factor loading at least at one occasion;— = cross-loading with a difference less than .20 or higher than .40, or factor loading < .40

The affect construct has two problematic indicators, a1 and a6, with low indicator reliabilities dealing with the enjoyment of mathematics and statistics 4

Changes in model fit, composite reliability and AVE for the expectancy-value constructs will be depicted in Section 7.2.1 after all modifications have been made.

208

7

Optimization of the Measurement Models

(“I like statistics/mathematics”) which cross-load on the interest construct. A5 has a cross-loading at t5 , and after deletion of a1 and a6, its indicator reliability drops below .40 at all measurement occasions. It may be that the psychological manifestation of frustration mentioned in a2 may differ from those conveyed by the other indicators (anxiety, stress). More concretely, anxiety and stress could be seen as more spontaneous, affective reactions, which may lead to frustration if they are left out of account or unredeemed. This indicator may therefore also be a candidate for removal. Deletion of these indicators shifts more explanatory power to a4, so that the construct seems to incorporate appraisals of physiological statistics-related anxiety rather than frustration or positive emotions towards the subject. The three indicators would also be dispensable because their content is also covered by the enjoyment (a1, a6) and hopelessness construct (a2). The indicator reliability of i1 is slightly below the lower bound of .40 when, which might be due to its pedagogical notion (“interest to communicate statistical information”) while the other indicators focus on self-interest. Its indicator reliability is also considerably smaller than those of the other three items. As mentioned in Section 6.5.2, v5 and v7 are the most problematic indicators of the value construct due to the cross-loadings and very low indicator reliabilities. Indicators v1 and v8 have just acceptable values and are considered second-priority candidates for deletion. Since composite reliability has not been problematic for the value constructs, the AVE will again serve as a starting point for a decision. Omission of v5 and v7 considerably increases AVE, however remaining slightly below .5. For further consideration, it has also to be borne in mind that the FornellLarcker criterion for I5-V5 and I9-V9 was not fulfilled (see Section 6.6.3), so that the AVE should be sufficiently high to exceed this inter-construct correlation. Therefore, v1 and v8 have to be considered for removal as well. It could be that v1 and v8 do not discriminate well enough between intrinsic and extrinsic motivation because of their very generic formulation without a study-related or professional context (i.e., v1: “statistics is worthless” and v8: “irrelevant in life” could be an appraisal that is either ascribed to intrinsic or extrinsic motives). The item v1 will be considered for removal first because it seems to perform worse than v8 (lower reliability, stronger cross loadings). Removal results in a slight decrease in composite reliability while also increasing the AVE beyond the threshold of .5. “Statistics is worthless” might be too generally formulated, while the other indicators refer more concretely to everyday life and study contexts. After removal of v1, the Fornell-Larcker criterion is still not fulfilled, so that additional removal of v8 has to be reconsidered because then the criterion accomplished (see Section 7.2). The high correlation between interest and value must be kept in mind for further analyses, even though it is comparable in height

7.1 Expectancy-value Indicators Considered for Removal or Optimization

209

to those of other studies using CFA (Stanisavljevic et al., 2014; Xu & Schau, 2021). Similar to the difficulty construct, but to a smaller extent, some of the value indicators had low loadings (< .50) in other studies (Hommik & Luik, 2017; Persson et al., 2019; Shahirah & Moi, 2019; Xu & Schau, 2021). In these studies, low loadings also mostly involve the items with general formulations (i.e., “Statistics is worthless”)5 . To sum up, it seems reasonable to also delete v1 and v8 to improve discriminatory power. Additional deletion of items to further reduce the intercorrelation between interest and value is not considered to avoid too stark deviations from the original scale. As for the expectancy constructs, the optimization significantly improved goodness-of-fit. RMSEA and SRMR already fulfill the cutoff criteria, whereas CFI and TLI could still be slightly improved. Another reason for models not fitting to the data could be specific variance shares of heterogeneous indicators (Geiser, 2010, p. 101) because measures in research are mostly not perfectly homogeneous measures (Eid et al., 1999). Therefore, modification indices (M.I.) for residual correlations will be investigated as a final step of optimization for the EV constructs6 .

7.1.3

Indicator-specific Expectancy-value Effects

Particularly for longitudinal measurements, indicators might carry over method effects due to their repeated usage and share idiosyncratic variance with themselves across time (Geiser, 2010, p. 96). Common method variance may lead to biased inter-construct correlative or causal relations. It also affects the homogeneity assumption of the conventional latent state model as the indicator would be correlated higher with itself over time, i.e., its measurement method, rather than with the other indicators of the latent construct at the same time (Eid et al., 1999; Geiser & Lockhardt, 2012). There are several reasons for method effects that can be subsumed under (1) common rater effects, entailing response 5

The other value indicators focusing on the employability however also seemed to function better in the present study than in other ones (i.e., “Statistics should be a required part of my professional training” in Persson et al., 2019 and Shahirah & Moi, 2019). 6 Both expectancy-value and emotion-related measurement models were investigated regarding common method bias. χ2 difference tests suggest that, in all cases, the more restricted model in which the common method factors was fixed at zero, fitted worse to the data than the unconstrained model with a common method factor. Hence, all measurement models seem to be subjected to a degree of method variance. Since the emotion constructs already yielded an adequate fit, method factors will not be modelled to avoid increasing the number of free parameters to the detriment of model parsimony (Weiber & Sarstedt, 2021, p. 406).

210

7

Optimization of the Measurement Models

tendencies, halo and leniency effects, mood fluctuations, and social desirability tendencies (Bagozzi & Yi 1991; Weiber & Sarstedt, 2021, p. 398). Such patterns may aggravate if participants feel observed or think that the answers are individually retraceable. A second cause for common method variance could be (2) item-characteristic effects (Weiber & Mühlhaus, 2014), such as the specific verbal nuances of Likert scales. This entails, for instance, inversely formulated items or items evoking overly negative states of mind, which might foster systematic effects stemming from stronger (dis-)approval tendencies (i.e., floor and ceiling effects; Weiber & Sarstedt, 2021, p. 399). Item-specific effects can also stem from vaguely formulated rating items, which lead participants to project their own idiosyncratic interpretation to the item (Xu & Schau, 2019; Weiber & Sarstedt, 2021, p. 399). For longitudinal studies, short items in particular might influence memorization effects in favor of over-consistent response patterns (Weiber & Sarstedt, 2021, p. 398). A third source of method bias are (3) contextual factors if the survey is structured such that it leads to a priming of study-relevant aspects which in turn influences the judgmental patterns (Weiber & Sarstedt, 2021, p. 399). Weiber and Mühlhaus add that implicit theories and assumptions about how aspects of the survey should influence one another may resonate in participants’ endeavor to provide consistent response patterns (2014, p. 356). For the present study, it is assumed that causes (1) and (3) are of minor importance. Response tendencies (1) should, on average, compensate one another based on the large sample size. The items were formulated situation-specific and assessed in concrete learning situations during the course or while learning statistics, so that the actual emotional and motivational states should resonate stronger in these patterns than tendential effects (see Section 5.2.3). A broader variety of constructs has been assessed to reduce method variance. Response patterns (tendency to check extreme, crisscross, or central answers) amounted to less than 1 % of the whole data and do not seem to translate into purposeful leniency or desirability effects. Contextual factors (3) should also be of less relevance since, when the surveys were announced at the beginning of each semester, the test director did not mention the state of research (i.e., statistics anxiety) or that the assessment relates to blended learning to avoid priming student expectations. Hence, it is assumed that most of the common method variance in this study stems from item-characteristic effects (2) beyond the actual constructs (Xu & Schau, 2021). As has been shown in the prior optimization, this is also reflected in problematic indicators that were vaguely or overly negatively formulated (e.g., “Statistics is worthless”) or unrelatable (e.g., “Most people have to learn a new way of thinking to do statistics”).

7.1 Expectancy-value Indicators Considered for Removal or Optimization

211

To explicitly account for method-specific variance, several methods exist (Geiser & Lockhardt, 2012). The correlated uniqueness approach makes use of auto-correlated residuals over time and treats method effects as part of the error variance. When using this method, the indicator-specific effect remains confounded with random measurement error, thus underestimating the indicator reliability (Geiser, 2013; Eid, 1996). Geiser and Lockhardt (2012) propose a few methods where separate variance components are modelled by means of method factors, which represent the indicator-specificity in contrast to the other indicators. Since measurement error is a threat to the reliability and validity of modelbased conclusions, method factors seek to separate the item- and test-specific variance as reliable, person-specific, systematical share from the random measurement error variance (Podsakoff et al., 2003). Such a factor thus inherits the share of reliable variance of the specific indicators that cannot be accounted for the other construct indicators. These latent variables measure indicator-specific, interindividual differences between the measurement occasions. A limitation of traditional method factors is that their meaning in relation to the residuals is not unambiguously assignable to general or specific trait factors, or a method effect. Moreover, the residual factor must not correlate with the original construct on any occasion (Eid et al., 1999), which lacks a theoretical foundation, particularly if two indicators share common method variance (Geiser & Lockhardt, 2012). Therefore, Geiser & Lockhardt (2012) recommend other methods based on more thoroughly defined trait and method factors for all indicators. However, method factors in the present study will only be modelled for indicators with high residual correlations across time as indicated by M.I., if justifiable as regards content, to avoid over-complexification through the additional parameters (Weiber & Sarstedt, 2021, p. 406) and to remain as close as possible to the previously postulated model under consideration of an acceptable model fit. M.I. represent the decrease in χ2 values if a fixed or restricted parameter is set free, i.e., included in the model to be estimated (Weiber & Mühlhaus, 2014, p. 245; Bühner, 2021, p. 501). For the modeling of indicator-specific effects, only non-reference indicators should be considered (Geiser & Lockhardt, 2012). Table 7.4 shows all M.I., i.e., the expected reduction in χ2 , greater than 10 (WITH statements) for the same indicators across time arranged in descending order. M.I. which are consistently high for one specific indicator across time should be considered to be modelled as method factor. The indicator a5 has the highest M.I. across time (372.10 in total). Second highest is v4 (168.46), followed by e1

212

7

Optimization of the Measurement Models

Table 7.4 Modification Indices Greater than 10 in Ascending Order Item 1

Item 2

M.I.

Item 2

M.I.

Item 1

a15

a55

201.03

Item 1 d11

d51

26.94

i12

Item 2

M.I.

i92

13.75

a55

a95

128.89

d53

d93

26.68

s13

s43

13.14

e51

e91

83.13

v52

v92

25.43

[d54]

d94

12.69

v14

v54

65.58

i13

i53

22.31

[v13]

v93

11.96

v54

v94

51.86

e13

e53

21.61

v12

v92

11.56

v54

v14

51.02

[i13]

i93

18.78

s13

s93

11.52

a53

a93

46.64

d13

d93

17.95

a13

a53

11.43

a15

a95

42.21

v12

v52

17.57

e11

e91

11.23

s52

s92

38.14

s43

s93

17.01

v16

v96

11.10

d13

d43

35.81

i14

i54

16.39

d51

d91

33.41

e11

e51

15.89

i12

i52

3.44

[s54]

s84

15.72

i52

i92

27.59

v16

v56

13.75

Notes. M.I. = modification index; [] = item cannot be considered for modification because it functions as reference indicator

(11.25), d3 (8.44), and d1 (7.70)7 . These findings coincide with those of Xu & Schau (2019, p. 40), who modelled a general factor revealing that 50% of the total variance of the difficulty construct as well as 40–50% of the value and affect construct was due to method effects. Starting with the value constructs, the items v4 and a5 yield the highest residual correlations over time while the M.I. for other items are rather small in comparison. Item a5 (“I am scared by mathematics.”) might have a stronger

7

Lower modification indices have been considered, but the respective indicators did not match to any of the above-mentioned item-characteristics as regards content and formulation. Moreover, model fit already increased considerably without considering more method factors. For learning- and course-related emotions, modification indices suggest, for instance, indicator-specific effects for hL 2 and jC 4 across time. However, no method factors were considered for these emotion indicators because model fit after optimization was already appropriate as it was. Therefore, it was preferred to keep the model less artificial.

7.1 Expectancy-value Indicators Considered for Removal or Optimization

213

method-specific effect since the other indicators are related to statistics. Moreover, Xu & Schau (2019, p. 41) argue that such negatively worded indicators might carry more indicator-specific variance as they are more emotionally charged than the positively worded items (e.g., “I will like statistics”). The item a5 was moreover not a part of the original SATS scale but added to the affect scale in the course of this project and might therefore inherit specific effects that had not been accounted for in prior studies. Item v4 (“Statistical skills will make me more employable.”) is a rather general (and maybe rather agreed-on) preconception that might not be fully responsive to experiences made while attending the statistics course. The answer might also depend on individual career choices, which may result in idiosyncratic answer patterns (Bagozzi & Yi, 1991). Regarding the item e1, there is no intuitive explanation for the higher indicator-specific variance. A potential explanation could however be the translation of the verb to German. While the original scale uses “completing” statistics tasks, the German item refers to “task processing”. It could be that the translation is too vague so that participants project different conceptions of task processing to the item (i.e., completion, having a short look at it, rehearsing etc.). Regarding the difficulty construct, it seems justifiable to assume that both indicators (“Statistics is a complicated subject.” and “Learning statistics requires a great deal of discipline.”) carry over stable test-specific method effects in repeated measurements. Due to their general formulation regarding statistics as a whole subject, they might involve overgeneralized response biases while the reference indicator d4 refers to state-like appraisals about statistics computations and formulae depending on current topics students are dealing with across time. The low method variance among the more situation-specific emotion indicators might also indicate that the items targeting more stable attributes, such as expectancy and value beliefs, are more susceptible to carry method variance. Xu and Schau (2020) also suggest that the “non-I” formulations of some items in the SATS-36, such as d1, d3, and v4, may assess stereotypical beliefs rather than the students’ true attitudes. To sum up, the height of some M.I.—along with the specific item formulations which likely involving specific characteristics over time along with Xu and Schau’s study (2020)—render it justifiable to model method factors for the indicators d1, d3, v4, e1, and a5. The results of these changes will next be opposed to the original models.

214

7

Optimization of the Measurement Models

7.2

Evaluation of the Modified Expectancy-value Constructs

7.2.1

AVE, Composite Reliability, Fornell-Larcker, and Factorial Structure

Table 7.5 compares the AVE and composite reliabilities of the original and new measurement models for the expectancy constructs. For composite reliability, M1 is omitted as the values equal those of M2. Table 7.5 Comparison of the Original and Modified Expectancy Models regarding Composite Reliability and AVE S1

S5

S9

D1

D5

D9

AVE0

.441

.438

.515

.294

.314

.355

AVEM1

.507

.489

.590

.468

.495

.496

AVEM2

no modification

.508

.583

.549

CRM0

.796

.795

.841

.691

.712

.674

CRM2

.750

.738

.811

.709

.805

.784

Notes. M1 = minus s1, s5, d2, d5, d6; M2 = additional method factors for d1 and d3; CR = composite reliability

The composite reliability of difficulty improves, even though half of the scale was removed, to a value above .7. Composite reliability for the modified difficulty construct surpasses the reliability found for modified models in other studies (e.g., Emmioglu et al., 2018; Stanisavljevic et al., 2014). This particularly applies to the study of Persson et al. (2019), which removed similar indicators, but where difficulty still did not reach acceptable reliability. For self-efficacy, composite reliability slightly decreases, but still remains clearly above the threshold of .7. Omission of the indicators (s1, s5, d2, d5, d6) results in improved, acceptable values for AVE, so that removal of both indicators should therefore be considered to avoid high residual inter-correlations between self-efficacy and difficulty when they are fixed at zero in the CFA. For D1, D5, D9, the threshold of .50 is only surpassed after modeling the two method factors for d1 and d3 (M2). Table 7.6 shows that the modifications also contribute to lower inter-correlations between self-efficacy and difficulty. For instance, the correlation between D5 and S5 decreases from –.484 to –.155 and the difference to the root AVE of

7.2 Evaluation of the Modified Expectancy-value Constructs

215

D5 becomes more pronounced in accordance with the Fornell-Larcker criterion (difference of .076 for M0 vs. .609 for M2). Table 7.6 Root AVE and Correlations of the Modified Expectancy Constructs with Method Factors S1

S5

S9

D1

D5

D9

S1

.712

.472

.483

–.093

–.113

–.029

S5

.472

.700

.677

–.097

–.155

–.229

S9

.483

.677

.769

–.089

–.094

–.139

D1

–.093

–.097

–.089

.713

.424

.442

D5

–.113

–.155

–.094

.424

.764

.615

D9

–.029

–.229

–.139

.442

.615

.741

Note. Root AVE of the construct is printed in bold. The cross-construct correlation at the same measurement occasion relevant for the comparison is underlined

To verify whether the omission of the indicators leads to a more clear-cut factor solution, another EFA is conducted with the remaining items8 . The parameter estimates suggest that the 2-factor model has a good fit. The 2-factor-solution at t9 yields an insignificant χ2 test, implying that the model does not fit significantly worse than the unrestricted correlation model and that the factor model has not to be rejected based on the collected data. Table 7.7 provides the GEOMIN rotated factor loadings. It shows that there is a clearer simple structure with the alternate selection while the items also adhere more strongly to the criteria that were mentioned earlier (i.e., cross-loading difference > .20, loading on the designated factor > .40). Concerning the indicator reliabilities9 of the expectancy constructs after optimization, d11 surpasses the lower bound of .40, while none of the other remaining difficulty indicators exhibits a remarkable change. For self-efficacy, the indicator reliability of s2 decreases slightly while more explanatory power shifts to s3 and s4. Particularly the originally questionable loading and reliability of s3

8

The model fit for the theoretically assumed factor solutions of the modified models will be omitted for reasons of space but can be found in Appendix 6 in the electronic supplementary material. 9 For reasons of clarity, indicator reliabilities will be tabulated together in subchapter 7.5 while only the ones that changed considerably will be thematized in this subchapter.

216

7

Optimization of the Measurement Models

Table 7.7 GEOMIN Rotated Loadings for the Modified Expectancy Constructs with Method Factors S1

S5

S9

D1

D5

D9

s2

.579

.514

.627

s3

.694

.782

.799

s4

.837

d1

–.221

.786 ..520

–.206

.796 .687

.636

d3

.746

.692

.777

d4

.754

.728

.684

Notes. Loadings between –.2 and .2 are left blank. Values in bold indicate that the item assignment conforms to the theoretical factor.

(see Section 7.1.1), which was not deleted because of its content-related relevance, improved significantly after optimization, signaling a consolidation of the construct on self-efficacy appraisals. Table 7.8 presents the AVE and composite comparisons of the original and the modified value models. Table 7.8 Comparison of the Original and Modified Value Models regarding Composite Reliability and AVE I5

I9

V1

AVEM0 .421 .403 .422 .603

A1

A5

A9

I1

.573

.627

.393 .396 .416 .467

V5

V9

E1

E5

E9

.580

.488

AVEM1 .588 .551 .574 .676

.651

.712

.548 .559 .559 no modification

AVEM2 .620 .660 .627 no modification

.574 .583 .590

CRM0

.809 .798 .805 .857

.842

.869

.832 .832 .844 .852

CRM2

.827 .849 .832 .862

.848

.881

.842 .845 .848 no modification

.719

.802

Notes. M1 = minus a1, a2, a6, v1, v5, v7, v8, i1; M2 = additional method factors for a5 and v1

Deletion of the two indicators a1, a2, and a6 leads to an increase of the AVE of affect above the cutoff of .5. The deletion of the four value indicators increases the AVE above the threshold of .50 (M1). Modelling the three method factors thus contributes to a considerable increase of the AVE for both constructs. The increase in AVE may be an indication of the decreased method variance, which translates into a higher convergent validity (Geiser & Lockhardt, 2012). Even though some variables have been deleted, the modifications resulted in an

7.2 Evaluation of the Modified Expectancy-value Constructs

217

increase in composite reliability in all cases. Removing i1 also seems justifiable because it does not decrease composite reliability, but considerable increases the AVE of interest (M1). The removal thus further increases the discriminant validity between value and interest in terms of the Fornell-Larcker criterion, which was shown to be in particular need for improvement in Section 6.6.3. Table 7.9 shows the correlation matrix of the modified version of the value constructs (M2) with the root AVE on the diagonal axis to evaluate FornellLarcker. Table 7.9 Root AVE and Correlations of the Modified Value Constructs with Method Factors I1

I5

I9

V1

V5

V9

A1

A5

A9

E1

E5

E9

I1

.822 .563 .519 .615 .422

.392

.225

.220

.187

.318

.064

.091

I5

.563 .807 .682 .421 .661

.492

.105

.270

.305

.267

.350

.244

I9

.519 .682 .844 .381 .484

.716

.050

.267

.373

.220

.174

.246

V1 .615 .421 .381 .758 .576

.564

.164

.165

.114

.194

.011

.057

V5 .422 .661 .484 .576 .764

.708

.058

.221

.186

.155

.148

.118

V9 .392 .492 .716 .564 .708

.768 –.034

.157

.232

.139

.060

.178

A1 .225 .105 .050 .164 .058 –.034

.787

.535

.452. –.176 –.199 –.145

A5 .220 .270 .267 .165 .221

.157

.535

.812

.682

–.092 –.065 –.084

A9 .187 .305 .373 .114 .186

.232

.452

.682

.797

–.149

.004

.084

E1 .318 .267 .220 .194 .155

.139 –.176 –.092 –.149

.812

.508

.415

E5 .064 .350 .174 .011 .148

.060 –.199 –.065

.004

.508

.683

.770

E9 .091 .244 .246 .057 .118

.178 –.184 –.145 –.084

.415

.770

.762

It can be seen that the Fornell-Larcker criterion is now fulfilled as the root AVE of V5 and V9 are greater than the correlation between I5-V5 and I9-V9, respectively. The criterion is reached only after additional deletion of the fourth questionable indicator v8 (see Section 7.1.2), so that the difference between the root AVE of V9 and the correlation between I9-V9 increases to .2810 . Modelling the additional method factor v4 increases this difference to .52. This is an indication of an improved discriminant validity for M2. Exploratory factor analysis for the modified version shows that the theoretically assumed 4-factor solution fit has improved considerably (see Appendix 6 in the electronic supplementary 10

The difference of .28 cannot be inferred from Table 7.9 because it represents the final modification M2, not M1.

218

7

Optimization of the Measurement Models

material). Regarding the 5-factor solution, there is only a very small relative increase in model fit and the additionally extracted, fifth factor contains one item only (a3). Table 7.10 shows the rotated matrix for the 4-factor structure. Table 7.10 GEOMIN Rotated Loadings for the Modified Value Constructs with Method Factors I1

V1

E1

A1

I5

V5 .247

E5

A5

I9

i2

.820

.536

i3

.815

.877

.867

i4

.808

.817

.820

.196

V9

.580

.605

.589

v3

.905

.917

.918

v4

.722

.689

.682

v6

.648

.628 .714

e2

.942

e3

.755

.578 .652

.257

A9

.732

v2

e1

E9

.566

.868

.980

.409

.687

a3

.619

.608

.595

a4

.973

.943

.941

a5

.665

.624

.646

Notes. Loadings between –.2 and .2 are left blank. Values in bold indicate that the item assignment conforms to the theoretical factor.

The revised solution more closely adheres to Thurstone’s idealized “simple structure”. One remaining, problematic loading is that of e53, which cross-loads on interest. As this indicator performs adequately at the other occasions, the indicator will not be considered for removal. For expectancy and value, the modifications led to a homogenization of the indicators within their designated construct.

7.2.2

Global Goodness-of-fit for the Expectancy-value Constructs

To evaluate whether these construct- and item-specific analyses translated into better model fit, Table 7.11 compares the goodness-of-fit of the EV constructs

7.2 Evaluation of the Modified Expectancy-value Constructs

219

of the models M0–M2. The models include all three measurement occasions, so that the estimations more closes approximate the later structural models. Table 7.11 Model Parameters for the Original and Modified EV Constructs Expectancy

Value

M0

M1

M2

M0

M1

M2

χ2

2,839.96c

607.82c

327.47c

9,338.99c

2,315.6c

1,545.90c

df

480

120

107

1824

636

597

SCF [%] 13.5

18.3

19.3

1.9

13.3

12.8

.052b

.037

.052c

.042

.032

90% C.I. .057–.059

.048–.056

.032–.041

.051–.053

.040.–044

.030–.034

SRMR

.083

.070

.049

.083

.057

.046

CFI

.767

.904

.956

.754

.910

.950

TLI

.744

.877

.938

.737

.895

.938

AIC

163,407.47 112,702.04 112,396.19 268,728.93 181,993.36 181,166.85

BIC

167,371.82 115,227.72 114,991.13 271,601.45 186,804.94 183,193.75

RMSEA

.057c

Notes. df = degrees of freedom, SCF = scaling correction factor

The fit considerably increased from the models M0 to M1 due to the omission of some indicators and considering that the CFI and TLI were very low. While models M1 yield acceptable values for RMSEA & SRMR, the CFI and TLI values are still slightly below the recommended thresholds. To avoid further deviation from the original scales through deletion of more indicators, M.I. were consulted as a basis to model method factors. In both cases, these method factors contribute to surpass the thresholds for both models M2. Under the assumption that the indicators are heterogeneous, CFI and TLI of the M2 model with residual factors are above the recommended value of .90 and come close to the upper recommended bound of .95 (Moosbrugger & Kelava, 2020, p. 649). Optimum values should be greater than .95 to minimize the probability of type I and type II errors, while values above .90 are also commonly accepted in literature (Weiber & Mühlhaus, 2014, p. 222). The other goodness-of-fit criteria fulfill the upper-bound criteria. SRMR, RMSEA and its 90% C.I. are below .05. The p-value of the RMSEA is insignificant for all modified models, indicating that the RMSEA of the population is not greater than .05. With each optimization, the AIC and BIC decreases, suggesting that the M2 models should be selected. Hence, the EV models will not be further modified. As the EV constructs of the

220

7

Optimization of the Measurement Models

SATS underlie the common framework of the EV model, the factor structure of the modified components will be analyzed together to particularly check for the high correlations between self-efficacy and affect that were shown in a variety of studies (see Section 5.3.2).

7.2.3

Examination of the Assumed 6-Factorial Structure of the Expectancy-value Constructs

The EV constructs have a strong theoretical grounding in Eccles and Wigfield’s EV theory (2001). To check the fit of the constructs within the overall EV framework, all six modified constructs were submitted to another exploratory factor analysis. The 5-factor solution generates a common factor for self-efficacy and affect, but the model does not fulfill the recommended cutoffs for the TLI and RMSEA. The 6-factor solution overfulfills all fit criteria while the 7-factor solution only is a marginal improvement compared to the 6-factor solution (see Appendix 6 in the electronic supplementary material). The latter solution extracts a seventh factor with only one high loading. These findings coincide with other recent studies in which more parsimonious factor structures did not improve model fit meaningfully (Persson et al., 2019; Xu & Schau, 2021). Hence, the theoretically assumed 6-factor-solution is considered adequate and the estimation problems involving a non-positive definite covariance matrix with the original set of indicators (see Section 6.4) did not occur for the optimized solution. Table 7.12 shows the model fit for the final EV solution with the theoretically assumed six factors, fulfilling all relevant cutoff criteria at all measurement occasions. The rotated factor matrix indicates that the factors have a simple structure in most cases. Item e53 (“I plan to study hard for the statistics exam”) cross-loads on the D5 and V5 and item i52 on V5. These cross-loadings can be neglected since both items function well at the other two measurement occasions and have appropriate item-specific criteria (i.e., indicator reliabilities). Most problematic is the item s2, which generally has a weak loading, cross-loads on the affect factor at all measurement occasions. It needs to be mentioned that the optimization itself does not seem to be the reason for these cross loadings because, the cross loadings also occurred with the original set of items11 . Further deletion of s2 is

11

Further sensitivity checks were performed to check whether there is any selection of selfefficacy indicators in which the cross loadings do not occur. However, in any constellation, at least one indicator loaded on another construct.

.430

.709 .795

d4

.874

.823

.534

I5

.662

.667

.897

.579

.253

V5

.291

.859

.663

E5

.336

.728

.894

.554

A5

.687

.638

.754

–.206

.417

D5

.675

.801

.252

.280

S5

.804

.835

.736

I9

.562

.642

.925

.553

V9

.575

.977

.586

E9

.376

.708

.906

.595

A9

.690

.872

.346

.205

.213

.238

S9

.698

.723

.626

.213

D9

Notes. Loadings between –.2 and .2 are left blank. Values in bold indicate that the item assignment conforms to the theoretical factor.

.464

d3

D1

d1

.830

.266

s2

s4

.617

a5 .667

.977

S1

s3

.585

.714

e3

a4

.936

A1

a3

.736

.651

v6

e2

.702

v4

E1

e1

.905

v3

.775

i4

.565

.798

i3

V1

v2

.834

i2

I1

Table 7.12 GEOMIN Rotated Loadings for the Modified EV Constructs with Method Factors

7.2 Evaluation of the Modified Expectancy-value Constructs 221

222

7

Optimization of the Measurement Models

also not considered because the construct would then only contain two items and not fully represent the original construct as regards content. A possible reason for the cross loadings between affect and self-efficacy could be the high correlations between these two constructs found in other studies (see Section 5.3.2), often exceeding .90 (Dauphinee et al., 1997; Hilton et al., 2004, p. 104; Nolan et al., 2012, p. 103; Vanhoof et al., 2011; Xu & Schau, 2019, p. 44). For the measurement models with the original number of items, the correlations in the present study were similarly high (approx. .91 for S5 with A5, and S9 with A9). The model optimization contributed to reduce the factor correlation considerably (.554–.656). According to Awang, two constructs tend to be redundant or subject to multicollinearity if the correlation exceeds .85 (2014, p. 55). Hence, it cannot be assumed that there is a common factor or hierarchical structure behind both constructs and the moderate correlations still suggest sufficient discriminant validity (Bechrakis et al., 2011)12 . The cross loadings found between both factors might still be remnants of this underlying overlap and suggest potential redundancy and caveats concerning collinearity, which has to be reconsidered in the later SEM models, in which self-efficacy and affect occur together. Hence and eventually, all six factors are kept separate to adhere to the theoretically reinforced structure of the EV model (Chiesi & Primi, 2009).

7.3

Evaluation of the Modified Achievement Emotion Constructs

7.3.1

Achievement Emotion Indicators Considered for Removal or Optimization

Course- and learning-related emotion indicators in all had less issues concerning item- and factor-level criteria. Table 7.13 shows less appropriate indicator reliabilities and cross loadings for course- and learning-related constructs. 12

To check whether affect and self-efficacy may be represented by one construct, an exploratory factor analysis yielded highly inappropriate fit indices for a 1-factor-solution and a 3-factor solution only consisted of one item (a5). A hierarchical structure yielded a considerably worse model fit than the models with two unique constructs. In the 5-factor model, the affect and self-efficacy indicators loaded highest on one factor, but the self-efficacy indicators had a loading below .40, so that the six-factor model along with the assumption of distinct statistics attitudes was retained. Therefore, and in order to leave the measurement models in accordance with the analytical model (Section 4.5), both factors will be treated distinctly.

7.3 Evaluation of the Modified Achievement Emotion Constructs

223

Table 7.13 Problematic Indicator Reliabilities and Cross-Loadings of the Emotion Models CL

t2

t4

t6

t9

jC 2

0

.42

.46

.46

.35

jC 3



.44

.41

.31

.28

hC 1

0

.44

.48

.53

.49

jL 2



.15

.19

jL 3



.25

.37

jL 4

0

.41

.54

jL 8

0

.33

.42

hL 1

0

.43

.47

Notes. CL = cross loading; + = indicator has a simple structure at all occasions; 0 = indicator only has minor cross-loading, but a weak factor loading at least at one occasion;— = cross-loading with a difference less than .20 or higher than .40, or factor loading < .40

The common basis for the learning- and course enjoyment indicators which do not function as well as the other ones is the appraisal on an anticipated learning progress to which the emotional state must be ascribed (e.g., “Reflecting on my progress in statistics coursework makes me happy”). For the other indicators, the emotional ascription directly refers to the immediate learning processes and appraisals to concrete activities. Particularly learning-related enjoyment indicators involving prior progress have a low indicator reliability mostly below .40 as well as cross loadings, so that they should be considered for removal. Even though jL 4 and jL 8 still have moderately high standardized coefficients compared to jL 2 and jL 3, they function worse than the other indicators. Bhansali and Sharma also identified these indicators as problematic because they are “double-barreled” and actually involve two appraisals that should be treated differently (2019, p. 37). For example, the indicator jL 4 (“I enjoy the challenge of learning the statistics material.”) the enjoyment appraisal is contingent upon the appraisal of whether studying the course material is challenging. This indicator performs markedly worse than jL 8 (“I enjoy dealing with the statistics course material.”), which directly bases the appraisal on studying the course material. Due to the suboptimal performance of these indicators and for reasons of consistency, all course and learning-related indicators referring to self-reflection on anticipated progress should be considered for removal. The respective course enjoyment indicators function slightly better, but particularly jC 3 has low indicator reliabilities at t7 and t9 and cross loadings and jC 2 at t9 . These items might also be considered for removal to keep the course-related scale consistent to learning-related enjoyment

224

7

Optimization of the Measurement Models

(i.e., without self-reflective aspects). The distributional patterns of jc 2, jc 3, jL 2, jL 3, and jL 8 also suggested a difference from the other indicators due to their strong ceiling effect. The ceiling effect could stem from the fact that “progress makes happy” is a rather undeniable statement, while the other indicators involve the appraisal of happiness in concrete situations. Regarding hC 1 and hL 1, the concept of frustration seems to differ from that of resignment and hopelessness addressed in the indicators 2–5. These two items have considerably worse indicator reliabilities than the other four. The distributional patterns already foreshadowed the differing appraisals since the indicators hL 2–5 and hc 3–5 have a stronger floor effect than hC 1 and hL 1, which may because frustration is more moderately appraised than hopelessness (see Section 6.5.2). Moreover, the two-factor solution of class-related emotions still has slightly inappropriate values (i.e., RMSEA & SRMR) when both items would remain in the models. To keep both learning- and class-related hopelessness consistent and to optimize their fit to the data, these two indicators are considered for removal, too. Concerning the EFA, the theoretically assumed two-factor structure for both course-and learning related emotions yield adequate fit indices (see Appendix 6 in the electronic supplementary material)13 .

7.3.2

AVE, Composite Reliability, Fornell-Larcker, and Goodness-of-fit

Table 7.14 opposes the AVE and composite reliability of the original and modified version without the problematic indicators. After removal of the above-mentioned indicators, AVE improves considerably for learning- and course enjoyment and moderately for hopelessness. Omissions of the above-mentioned indicators led to no remarkable decrease or increase in composite reliability. Due to the high initial reliability of the original solution, the values of the modified solution are still far beyond the recommended cutoffs. Balanced against the high increase of the AVE, the occasional slight decrease of composite reliability (i.e., course hopelessness) is deemed justifiable. To check the impact of the modifications, the goodness-of-fit of the modified measurement models will be reconsidered. Table 7.15 compares the model fit indices of 13

Cross-loadings will not be presented for the modified solution because no substantial ones were found. The rotated factor matrix yields a simple structure for all remaining items. Fornell-Larcker and indicator reliabilities will not be depicted either as these criteria were already adequately fulfilled before modification.

7.3 Evaluation of the Modified Achievement Emotion Constructs

225

Table 7.14 Comparison of the Original and Modified Emotion Models regarding Composite Reliability and AVE J3L

J7L

H3L H7L J2C

J4C

J6C

J9C

H2C H4C H6C H9C

AVEM0 .488 .544 .603

.633

.599 .628 .620 .598 .578

.587

.616

.608

AVEM1 .707 .737 .667

.677

.665 .706 .718 .718 .609

.617

.647

.642

CRM0

.878 .902 .882

.895

.902 .921 .918 .910 .870

.881

.900

.896

CRM1

.906 .918 .880

.893

.908 .923 .927 .927 .858

.868

.891

.893

Notes. M1 = minus jL 2, jL 3, jL 8, hL 1, jC 2, jC 3, and hC 1

the original and modified versions of the emotion-related measurement models including all measurement occasions. Table 7.15 Model Parameters for the Original and Modified Emotion Constructs Course Emotions χ2

Learning Emotions

M0

M1

M0

M1

5198.41c

1774.16c

262.52c

335.07

df

1052

566

293

98

SCF [%]

15.2

16.8

23.9

23.59

RMSEA

.051

.037

.072c

.040

90% C.I.

.049–.052

.035–.039

.070–.075

.035–.045

SRMR

.081

.059

.102

.036

CFI

.856

.944

.857

.978

TLI

.845

.938

.841

.973

AIC

177,341.074

142,645.20

150,222.207

110,657.594

BIC

182,144.317

146,423.75

153,467.63

112,943.953

Notes. df = degrees of freedom, SCF = scaling correction factor

Omission of the indicator considerably improves goodness-of-fit for the twofactor solution at all measurement occasions. The χ2 -value is still significant for all original and modified models, which may be due to its sensitivity to the large sample size (Weiber & Mühlhaus, 2014, p. 204). CFI and TLI are close to or slightly greater than the upper-bound recommended cutoff. SRMR is close to good values of .05, but still acceptable as the value is below .08. The RMSEA and its C.I. indicate that the population values are below .05. In sum, the modifications contributed to further improvement of the measurement models so

226

7

Optimization of the Measurement Models

that almost all criteria fulfill the upper-bound thresholds, except for the SRMR, which is however still acceptable.

7.4

Residual Correlations

M.I. also give an insight into residual correlations within the model that should be freed to improve model fit. By default, Mplus assumes that the residual variables 1jk are uncorrelated because variability is considered unique to each indicator (Geiser, 2010, p. 134; Landis et al., 2008, p. 200). In some cases, freeing residual correlations might be recommended to account for often minorly shared effects between two indicators. A high residual correlation suggests that there is a common cause of both affected indicators other than the latent factors, or that they measure more specific aspects of the latent variable. As this common causal variable does not exist in the model, its influence goes in both residuals (Landis et al., 2008, p. 204). Freeing the residuals improves model fit because the neglected cause is then accounted for. A differentiation has to be made between manifest and latent residual correlations. Table 7.16 displays M.I. greater than 20 for residual correlations of manifest items (WITH statements) in descending order. Only residuals of indicators of the same construct and at the same measurement occasion will be considered because it is not recommended to correlate residuals between different measures. Table 7.16 Modification Indices for Within-Construct Residual Correlations > 20 Item 1

Item 2

M.I.

Item 1

Item 2

jC 65

jC 66

101.74

jC 95

jC 96

M.I.

jC 25

jC 26

86.09

hL 74

hL 75

4.29

i53

i54

71.41

jC 94

jC 96

36.03

jC 45

jC 46

64.99

jC 24

jC 26

24.96

hL 72

hL 73

6.73

jC 64

jC 66

2.87

e51

e52

44.96

4.35

The highest decrease in χ2 would result from freeing residual correlations that belonging to indicators of course enjoyment, learning-related hopelessness, effort, and interest. In particular, jC 5 and jC 6, jC 4 and jC 6, have high M.I. at each measurement occasion while the residuals between the hopelessness, effort and

7.4 Residual Correlations

227

interest indicators only occur occasionally and not according to particular consistent patterns. For jC 4, jC 5 and jC 6, the similar wording related to excitement and tension in learning new statistical content could be a shared cause for the residual correlation. HL 2 and hL 3 might correlate due to the reference to limited cognitive capacities while hL 4 and hL 5 refer to limited endurance. E1 and e2 might correlate because they refer to engagement in ongoing course activities while e3 refers to the summative exam. Albeit seemingly plausible explanations and being a common practice for the majority of SEM publications, freeing residual correlations is disputed for its atheoretical nature, the risk of masking hierarchical structures within the data, and the artificial capitalization of idiosyncratic samplespecific characteristics that might be inapplicable to the population (Landis et al., 2008, p. 194). While freeing the residual correlations accounts for the putative cause of both affected indicators, it is actually never specified, i.e., the causative variable is not restored. Conversely, most residual correlations remain unjustifiable unless the new model is cross validated with new data to rule out sampling errors (Landis et al., 2008, p. 209). Coming to the implications for this stage of analysis, residual correlations between manifest variables will not be modelled since a good fit to the data has already been obtained without an overly data-driven post-hoc modification by omitting problematic indicators based on their reliability and validity (Cole et al., 2007, p. 381). Due to the inconsistent patterns for most sets of variables, larger residuals could also stem from chance findings (Kosovich et al., 2014, p. 810). For the later structural models, the height of the residual correlations will be reconsidered in light of their model fit. Moreover, residual correlations between the endogenous latent variables at the same measurement occasion are fixed at zero per Mplus default. For longitudinal models, these restrictions have to be relaxed due to occasion-specific circumstances impacting all constructs (Geiser, 2010, p. 134; Muthén, 2002). For instance, students assessing themselves to be less self-efficacious might also tend to feel more hopeless at the same measurement occasion. As these effects would not be sufficiently accounted for by the Mplus default fixation, correlated residuals between latent variables were allowed for in the structural models14 . Hence, the magnitude and significance of the residuals can be checked to evaluate the impact of unconsidered predictors or covariates.

14

Each model has been checked regarding changes of parameter estimates due to specified latent residual correlations. Besides few negligible changes in the cross-lagged estimates, all significant and non-significant interrelations between attitudinal, emotional, and cognitive outcomes remained the same.

228

7

7.5

Optimization of the Measurement Models

Final Evaluation and Reconceptualization of the Modified Measurement Models

Table 7.17 summarizes the maintained indicators along with their reliabilities for the upcoming analyses from the optimized models. Table 7.17 Indicator Reliabilities of the Modified Measurement Models t1

t5

t9

s2

.35

.31

.44

s3

.45

.53

.64

t1

t5

t9

i2

.66

.58

.73

i3

.66

.68

.73

t2

t4

t6

t9

jC 1

.64

.71

.73

.73

jC 4

.52

.61

.60

.60

t2

t7

jL 1

.59

.62

jL 5

.72

.78

s4

.73

.64

.71

i4

.72

.70

.70

jC 5

.77

.79

.80

.82

jL 6

.79

.77

d1

.41

.64

.58

v2

.46

.56

.57

jC 6

.76

.76

.77

.78

jL 7

.72

.79

d3

.58

.67

.65

v3

.76

.80

.78

jC 7

.63

.66

.70

.67

hL 2

.44

.53

d4

.65

.67

.56

v4

.67

.72

.78

hC 2

.29

.43

.54

.47

hL 3

.67

.68

a3

.43

.42

.46

v6

.47

.33

.31

hC 3

.63

.59

.66

.64

hL 4

.79

.80

a4

.86

.84

.82

e1

.55

.60

.74

hC 4

.76

.73

.75

.79

hL 5

.69

.70

a5

.63

.89

.69

e2

.87

.66

.85

hC 5

.75

.76

.74

.81

e3

.61

.33

.55

It can be seen that most indicator reliabilities are above the recommended value of .5. The items s2, v6, e3, and hc 2 are the only ones with reliabilities below the critical value of .4. These items will not be deleted because they fit in with the meaning of the modified construct, are not affected by cross loadings of a greater extent and have one acceptable reliability at least at one measurement occasion. The modified measurement models for expectancy and value along with the modelled residual factors, as well as course- and learning-related emotions are presented in Figure 7.1 and Figure 7.2. The indicator-specific effects were not assumed to be orthogonal. Thus, they were allowed to correlate between each other according to the indicator-specific trait approach (Geiser & Lockhardt, 2012, p. 260) and also with the other nonreference constructs (Geiser, 2010, p. 144). Loadings on the indicator-specific factors are all significantly different from zero, so that there are indicator-specific effects that were not accounted for in the former model (Geiser, 2010, p. 102). Squared standardized loadings suggest that the method factors account for one fifth of the variance of v4 and d3, one third of the variance of d1 and e1, and half of the variance of a5. According to Geiser (2010, p. 103), this indicates that the indicators are rather heterogeneous since a high share of the

7.5 Final Evaluation and Reconceptualization of the Modified Measurement …

229

Figure 7.1 Modified Expectancy Measurement Models

Figure 7.2 Modified Value Measurement Models. (Notes for Figure 7.1 and Figure 7.2. Reference indicators are underlined and bold. For reasons of clarity, the correlations between the indicator-specific factors and the latent constructs are not depicted. Correlations between the indicator-specific factors are depicted if they were significant at the .05-level)

230

7

Optimization of the Measurement Models

observed variance stems from indicator-specificity. The low correlations between the indicator-specific factors indicate that they share only a small extent of common specificity. While the standardized coefficients of the respective indicators on the latent constructs are acceptable, ranging from .60–.70 (except for d11), they also load considerably on the indicator-specific factor, ranging from .40–.50, with the greatest loading being .71. Hence, in terms of size, the indicator-specific effects can hardly be neglected (Geiser, 2010, p. 105). Figure 7.3 and Figure 7.4 show the learning- and course emotion measurement models, which do not have residual factors.

.62

Course Enjoyment T2

Course Enjoyment T6

Course Enjoyment T4

Course Enjoyment T9

jc1

jc4

jc5

jc6

jc7

jc1

jc4

jc5

jc6

jc7

jc1

jc4

jc5

jc6

jc7

jc1

jc4

jc5

jc6

jc7

Ɛ11

Ɛ14

Ɛ15

Ɛ16

Ɛ17

Ɛ31

Ɛ34

Ɛ35

Ɛ36

Ɛ37

Ɛ51

Ɛ55

Ɛ55

Ɛ56

Ɛ57

Ɛ81

Ɛ84

Ɛ85

Ɛ86

Ɛ87

Figure 7.3 Modified Course Emotion Measurement Models

Taking a closer look at the standardized loadings, most of them have weak loadings at only one measurement occasion (such as e53, hc 22). Many standardized loadings for each single indicator across time yield similar values, with some exceptions, such as e53, having a considerably lower loading than e13 and e93. The four emotion constructs in particular have factor loadings that hardly change across time, except for hc 22. The stability of most factor loadings across measurement occasions to the assumption of them being time invariant (Geiser, 2010,

7.5 Final Evaluation and Reconceptualization of the Modified Measurement …

Learning Enjoyment T3

231

Learning Enjoyment T7

jl1

jl5

jl6

jl7

jl1

jl5

jl6

jl7

Ɛ1

Ɛ5

Ɛ6

Ɛ7

Ɛ1

Ɛ5

Ɛ6

Ɛ7

Learning Hopelessness T3

.53

Learning Hopelessness T7

hl2

hl3

hl4

hl5

hl2

hl3

hl4

hl5

Ɛ2

Ɛ3

Ɛ4

Ɛ5

Ɛ2

Ɛ3

Ɛ4

Ɛ5

Figure 7.4 Modified Learning Emotion Measurement Models. (Notes for Figure 7.3 and Figure 7.4. Reference indicators are underlined and bold)

p. 103). The picture for the expectancy and value constructs is fuzzier, so that the reference indicators seem to be stable across time, but some other indicators differ more strongly, such as s2, s3, d1, d3, v6, e3. Time invariance is a necessary requirement for latent mean comparisons and will be tested more thoroughly in the next chapter. The highest autocorrelations occur between the penultimate and final constructs, ranging from .6 to .8 while the correlations between the first and final measurement are lower and range from .4 to .6. The higher value of neighboring intercorrelations suggests that the interindividual differences in the EV appraisals consolidate following an autoregressive structure over the course of the semester and might transition into more trait-like appraisals (Geiser, 2010, p. 106; Pekrun et al., 2010, p. 46). A latent state-trait modelling of the factors reveals that differences in the stable trait factors account for a substantiative share of variance within the state factors. Stable dispositions assessed at the beginning

232

7

Optimization of the Measurement Models

of the semester account for one third to half of the variance of the corresponding states whereas trait-like appraisals have the strongest explanatory power in the middle and at the end of the semester, ranging from 59% to 82%. Conversely, two thirds to half of the variance at the beginning and one third to one fifth of the variance at the end of the semester could still be attributable to time-specific influences. Hence, stronger variation and influenceability of the appraisals is expected during the first half of the semester and less volatility in the second half15 . As the removal of indicators from validated scales can be seen as a potential threat to content and construct validity, Table 7.18 provides a documentation of the revised construct connotations based on the remaining indicators. In all, indicators with over-generalized, prejudiced formulations (i.e., difficulty, value) which had no direct reference to students’ personal self-assessment in a sufficiently concrete motivational or emotional context have been removed on grounds of their reliability, cross loadings, and their impact on overall construct performance. The revised constructs have in common that they refer more directly to personally perceived, contextual motivational and emotional states within a statistics course. Under consideration of the original EV dimensions (see Section 3.2.2), the modified value construct exclusively measures utility value since the more general indicators relating to personal attainment value were deleted. Two of the deleted difficulty indicators involve social comparison, so that they may rather belong to academic self-concept than to perceptions of tasks demands in the narrower sense (Bong & Skaalvik, 2003, p. 9; Marsh et al., 2019). Affect is a construct that was newly composed for the SATS-M that was to account for statistics-related anxiety and enjoyment value (Ramirez et al., 2012). The removal of the notions of affective enjoyment is compensated by the course- and learning-related enjoyment constructs of the achievement emotion questionnaire in the structural analyses. A limitation of such post hoc modifications, or specification searches, is their data-driven exploratory nature and that they thus might not generalize to the population (Landis et al., 2008, p. 194). However, at least for the SATS, a limited number of studies used unparcelled CFA which documented similar challenges compared to the present findings (Hommik & Luik, 2017; Shahirah & Moi, 2019; Van Hoof et al., 2011, etc.). This serves as a first indication that the pinpointed potentials for optimization are not entirely due to sampling error. The optimized measurement models will serve as the basis for all further analyses, beginning with the analyses of time invariance.

15

The volatility of the dispositions and their state- or trait-like nature of the assessed constructs will be investigated further in the upcoming mean value changes and the modelling of autoregressive paths within the structural models.

7.5 Final Evaluation and Reconceptualization of the Modified Measurement …

233

Table 7.18 Comparison of the Original versus Revised Construct Meanings Modification

Revised meaning

Statistics-related self-efficacy • Deletion of indicators that combine self-efficacy and difficulty appraisals.

• Self-assessed convictions about the ability to understand statistical subject matter.

Statistics-related difficulty • Deletion of indicators relating to common • Subjectively perceived difficulty of preconceptions about statistics as an statistics from the individual point of overarching concept from a third-person view. perspective. Interest—Statistics-related interest value • Deletion of one indicator related to the interest of teaching/conveying statistical content.

• Interest in studying statistics to understand and apply it for one’s own purposes.

Value—Statistics-related utility value • Deletion of indicators containing overgeneralized opinions that related to attainment value

• Self-assessed usefulness of statistics for one’s professional training and career.

Affect—Statistics-related affect and anxiety • Deletion of indicators relating to statistics- and math-related enjoyment and frustration.

• Assessment of statistics-related affective reactions of anxiety and stress.

Statistics-related effort • No change Course enjoyment • Deletion of indicators relating to enjoyment reflecting about an alleged future learning success.

• Enjoyment related to activities currently taking place (i.e., participation in the course, listening to the professor).

Learning enjoyment • See above

• Enjoyment related to activities currently taking place (i.e., studying the course material).

Course hopelessness • Deletion of indicators relating to less concrete frustration rather than causative resignment/hopelessness.

• Causal resignment related to participation at the statistics course (i.e., lack of understanding and concentration). (continued)

234

7

Optimization of the Measurement Models

Table 7.18 (continued) Modification

Revised meaning

Learning hopelessness • Deletion of indicators relating to general affective frustration rather than causative resignment/hopelessness.

• Causal resignment related to studying statistics in different situations (i.e., lack of motivation, lack of understanding).

8

Results of the Longitudinal Study

8.1

Testing Measurement Invariance Across Time

Before investigating the reciprocal effects throughout the semester, the net change in their mean structure of the constructs will be analyzed. As one of the main research questions is related to the change in students’ AME appraisals during the semester period, measurement invariance needs to be tested before the structural analyses. Justifiable and meaningful comparisons of latent state mean values however require an equivalent instrument structure on all occasions throughout the semester (Geiser, 2010, p. 126; Murray et al., 2017). Factorial invariance provides evidence for true changes in the means and variances over time, which are not obscured by changes in the way the construct is measured across time (Murray et al., 2017). Configural invariance, assuming item-factor associations across time, was already established in chapter 7. The good-fitting measurement models with six discrete attitudes and four discrete emotions including all measurement occasions will serve as the most parsimonious baseline models without additional constraints against which further restricted models will be tested (Byrne et al., 1989, p. 456). Hence, for the fulfilment of weak factorial invariance, factor loadings of the same indicators will be held constant to test the equivalence of the relationship between the latent factor and its indicators across time. For strong factorial invariance, the factor loadings plus intercepts are held equal to ensure comparison of mean differences. Strict factorial invariance additionally requires the equality of the residual variances over time.

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-658-41620-1_8.

© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 A. Maur, Electronic Feedback in Large University Statistics Courses, https://doi.org/10.1007/978-3-658-41620-1_8

235

236

8

Results of the Longitudinal Study

Only few studies documented the invariance of the SATS across time (Chiesi & Primi, 2009, Hilton et al., 2004, p. 99; p. 311). These studies only attested invariance of factor loadings (i.e., no intercept/residual variance) at two occasions (pre/post) and used the outdated four-factorial instrument (without interest and effort). Moreover, except for Vanhoof et al. (2011), the analyses had been based on the parceling technique, which was found to favor the assumption of actually nonexistent measurement invariance based on simulation studies (Meade & Kroustalis, 2006). Evidence for the relevant achievement emotions is more detailed and substantial with mostly similarly appropriate model fit estimates for strong as well as strict factorial invariance across time (de la Fuente et al., 2020, p. 8; Pekrun et al., 2017, p. 1660; Peterson et al., 2015, p. 89). Only few studies were found that needed to relax several intercepts to achieve partial strong factorial invariance (Buff, 2014; Putwain, Larkin, et al., 2013, p. 367). Based on the empirical findings, the expectation is that the SATS constructs, without the guise of parcels, might function worse in time-invariant models compared to the consistently well-performing achievement emotions constructs. Difference testing and testing for lack of fit are used to evaluate the adequacy of this assumption. Difference testing is based on the χ2 using the SatorraBentler scaling correction due to the MLR estimation (Muthén & Muthén, 2011; Satorra & Bentler, 2010). The significance of difference tests is sensitive to small model deteriorations for greater sample sizes and fit indices vary in their sensitivity to non-invariance of different degrees. Therefore, Chen (2007, p. 501) developed several rules of thumb for model comparison based on extensive simulation studies. Surpassing the recommended change in model fit (see Table 8.1) for N > 300 in relation to the more restricted model would indicate measurement non-invariance. As RMSEA and SRMR tend to over-reject invariant models only for smaller sample sizes, these indices are used as a reference for the present study, too. Table 8.1 Magnitude of Model Fit Change Indicating Measurement Non-Invariance

Fit index

Addition of invariance constraints of … Loadings Intercepts Residual variances

 CFI

≥ −.010

≥ −.010

≥ −.010

 RMSEA

≥ .015

≥ .015

≥ .015

 SRMR

≥ .030

≥ .010

≥ .010

If the fit indices are not sufficiently adequate to assume measurement invariance, freeing separate parameters based on the M.I. will be considered to fulfil

8.1 Testing Measurement Invariance Across Time

237

partial invariance, if necessary (Chungkham et al., 2013, p. 3). Before comparing the measurement models, the single reference indicators have to be checked. For these indicators, metrical invariance is implicitly assumed due to the fixation at unity to provide a meaningful metric for the scale of the latent variable (Cheung & Rensvold, 1999, p. 7; Murray et al., 2017; Steenkamp & Baumgartner, 1998, p. 81). Non-invariance of the reference indicator might hamper the detection of true (non-)invariance and bias the latent mean estimation (Cheung & Rensvold, 1999, p. 8; Murray et al., 2017). This is why reference indicators with consistently high loadings on the latent construct across time were selected (see section 14.3), adequately reflect the whole construct as regards content. Moreover, as sensitivity checks, fully unconstrained models were compared to constrained models in which only the alleged reference indicator was fixed at unity across time (see Table 8.2). Table 8.2 Sensitivity Check of the Item-Level Invariance of the Selected Reference Indicators Ref. item

jL 5

hL 5

jc 5

hC 3

s4

d4

i2

v3

a4

e2

2

2

2

6.13a

3.77 28.62c

Fit difference of the unconstrained to the constrained model  df  SB-χ2

1 1.79

 – RMSEA  SRMR

1

3

2.25 –

.001

12.42c –

.001

3

2

1.32

4.74a

3.73







.001 –

 CFI









 TLI











2

.003

.001

.001

.001 –

2 11.39c

.001 –



.001





.002

.001 –

.002

– .001 –





.001

The insignificant χ2 difference tests show that for five indicators (jL 5, hL 5, hc 5, i2, a4), the constrained, more parsimonious models with the restricted loadings of the reference indicators across time, do not differ significantly from the unconstrained models, pointing to the invariance of the respective indicators. Two more constrained models (s4 and v3) are significant only at the 10%-level and three models (jc 4, i2 and e2) are significant at the 1%-level. Significant difference tests favor the unconstrained model and would indicate non-invariance. However, the decrease in fit ( CFI, RSMEA, SRMR) is < .003 for all models, so that the

238

8

Results of the Longitudinal Study

difference between the models is deemed negligible1 . Moreover, according to a few simulation studies, the concrete selection of a particular reference indicator does not considerably affect the χ2 statistic of metric invariance models (Klopp & Klößner, 2020, p. 4). After having investigated the item-level invariance of the reference indicators, the fit indices of increasingly constrained measurement models will be compared following the above-described equality restrictions. The aim is to adopt a model of at least strong factorial invariance as a minimum requirement for latent mean comparisons across time (Geiser, 2010, p. 108; Steenkamp & Baumgartner, 1998, p. 82). Strong factorial invariance ensures that indicator-related mean changes reflect true changes aside from the intercepts. Table 8.3 displays the goodnessof-fit of the different measurement models. Satorra-Bentler χ2 difference testing yields significant p-values < .01 for all degrees of invariance across the EV models. This indicates that further constraints deteriorate the model fit significantly. The course-related model of weak invariance is significant at the .10 level while the learning-related models of strong and strict invariance are insignificant. Particularly for learning-related emotions, the models with more invariance restrictions do not lead to a worse fit to the data than the unconstrained models. χ2 difference tests suggest a better fit the invariant emotion-related constructs. However, they are sensitive to small discrepancies between the empirical and implied covariance matrices, so that the other fit indices, and their change according to the degree of invariance, are consulted to substantiate model comparison. For the RMSEA and the SRMR, Chen’s recommended thresholds (2007) are not surpassed for any model for any degree, suggesting non-invariance based on these two criteria. The SRMR for all model groups except learning emotions is greater than the “good” threshold of .06 at the latest when residual variances are equated, but still remains considerably below the upper-bound cutoff of .08. For the CFI, there is a mixed picture for the different model groups. For the expectancy and value constructs, the change in CFI is greater than .010 as soon as the intercepts are set invariant across time, which indicates a lack of strong factorial invariance and conforms to the results of Vanhoof et al. (2011). The differences might partly stem from the fact that three measurement occasions are involved while these studies only assessed pre-post-invariance and only included 1

If a potentially non-invariant reference indicator was detected, the respective models were compared to such models in which the other variables function as reference indicators (Byrne et al., 1989). In all cases, the selected reference indicator already was deemed to be the “most invariant” indicator based on the smallest increase in χ2 values, decrease in model fit, and modification indices.

8.1 Testing Measurement Invariance Across Time

239

Table 8.3 Goodness-of-Fit across Invariance Levels assuming the Unconstrained (Baseline) Model to be Correct Invariance of

S, D

Fit index

Model Form

Loadings

Intercepts

Residual Variances

Partial Invariance of Intercepts

Model #

MI 0

MI 1

MI 2

MI 3

MI 2*

χ2 /df

327.47/107c

37.15/119c

512.7/127c

601.66/139c

391.95/125

43.15c

109.13c

119.91c

26.23c

 SB-χ2 RMSEA .037

.037

.045

.047

.037

SRMR

.055

.062

.065

.058

.950/.936

.924/.908

.909/.899

.049

CFI/TLI .956/.938

.947/.935 * [s54], [d14]

I, V, A, E

χ2 /df

1,545.90/597c 1,686.06/621c 2,133.78/639c 2,689.12/665c 1,761.99/637c

 SB-χ2

132.45c

508.03c

437,82c

343.22c

RMSEA .032

.034

.039

.045

.034

SRMR

.049

.060

.067

.049

.944/.933

.921/.908

.893/.881

.046

CFI/TLI .950/.938

.941/.931 * [a13], [e53]

JC , χ2 /df HC  SB-χ2

1,774.16/566c 1,827.18/587c 1,907.08/608c 1,953.24/635c not necessary 3.71a

67.20c

61.72c

RMSEA .037

.037

.037

.037

SRMR

.060

.061

.063

.059

CFI/TLI .944/.938 JL , χ2 /df HL  SB-χ2

335.07/98c

.943/.938

.940/.938

.939/.939

352.49/104c

363.43/110c

391.56/118c

16.77b

8.56

43.40

RMSEA .040

.040

.039

.039

SRMR

.039

.039

.041

.977/.973

.976/.974

.975/.974

.036

CFI/TLI .978/.973

not necessary

240

8

Results of the Longitudinal Study

four constructs, respectively. The course- and learning-related model groups, in contrast, only show a very small decrease in the CFI values. For course emotions, the change in CFI from configural to strict invariance only is .05, and for learning-related emotions .03. The CFI values for both model groups also remain close to the good cutoff of .95. The results conform to the homogeneity of the indicator loadings, which was found to be more consistent for the emotion-related constructs than for the EV constructs (see section 7.5). Hence, in accordance with the above-mentioned research findings, the achievement emotion models seem to consistently sustain the assumption of strict factorial invariance while the expectancy and value models have to be considered for revision because of the significant decrease in CFI. The generally worse performance between the EV and achievement emotions models in terms of measurement invariance could partly stem from unmodelled and temporally differential method variance resonating in the first-mentioned constructs. Such a method variance had already been detected for the SATS apart from the present study (Xu & Schau, 2019) and may be a reason for poorly fitting invariance models (Steenkamp & Maydeu-Olivares, 2020). Whenever attitudes and their change are assessed by means of psychological measures, particularly stricter forms of invariance are seldomly fulfilled (Steenkamp & Baumgartner, 1998, p. 81). As Byrne et al. pointed out in a methodological paper that many studies cease to further investigate data patterns when they were non-invariant (1989, p. 458). There is however the possibility that at share of indicators, i.e., at least more than one, still is invariant (Cheung & Rensvold, 1999, p. 3; Steenkamp & Baumgartner, 1998, p. 81). Explicit modeling of partial invariance assumptions might also be appropriate when substantial evolving behavior of young adolescents is examined (Murray et al., 2017; Pentz & Chou, 1994, p. 451). As Steenkamp and Baumgartner point out, invariance of at least one indicator per factor might be sufficient so assume partial invariance (1998). To test for partial invariance of the EV models, individual M.I. instead of the above omnibus criteria were consulted to find indicators that cause more severe problems in the model (Steenkamp & Baumgartner, 1998, p. 81). For the expectancy constructs, M.I. for the covariances between the error terms only suggest a marginal expected improvement of maximally 2. Concerning the intercepts, s54 is the only indicator to contribute to a considerable decrease in χ2 of approximately 60 while the other ones are deemed negligible with anticipated decreases of only 2. Allowing variance in the intercepts of s54 in the model of strong factorial invariance considerably increases the model fit of MI 2*. The significant χ2 difference test suggests that MI 2*, being the less restrictive model, does not fit significantly worse to the data than the model MI 2*. Comparing

8.1 Testing Measurement Invariance Across Time

241

MI 2* to MI 1, the decrease in CFI is .003 and thus within the recommended threshold of .01. Similarly, for the value model constellation, M.I. are consulted to establish an agreeable fit for the model of strong factorial invariance. Freeing of the residual covariance with the highest modification index (approx. 70) is once again neglected because it is far below that of the equated intercepts of e53 (257.63) and a13 (91.90). Besides, all other M.I. for the intercepts are again negligible and < 4. Freeing two intercepts is considered tenable due to the high number of parameters of the value model. Again, for the modified model MI 2*, the thresholds compared to model MI 1 are fulfilled as the difference in CFI is .003. RMSEA and SRMR do not deteriorate MI 2*, and the significant χ2 difference test also indicates that the more restrictive MI 1 fits worse to the data than MI 2*. A tentative, yet speculative, explanation for the non-invariance of the intercepts at the first and second time of administration could be the lacking experience with statistics courses and stable opinions about the field leading to time-specific attitude fluctuations, which do not occur anymore to this extent at the end of the semester. In sum, measurement invariance testing supports the assumption of partial strong measurement invariance for the EV constructs and strict measurement invariance for the emotion constructs, so that all models can be included in the longitudinal structural analyses. These findings also confirm that the parceling approach of the invariance tests used by Hilton et al. (2004) may indeed have obscured some problematic items while the well-functioning emotion measurement across time conforms to the studies of, for instance Pekrun et al., 2017 and de la Fuente et al., 202. For this study, it was demonstrated that factor structure, factor loadings, and most, but not all intercepts as well as residual variances are equal for the SATS and AEQ across all three measurement occasions. For the expectancy and value constructs, the modified models MI 2* of partial strong factorial invariance will be assumed for mean comparisons. The model of strict factorial invariance will not be considered anymore since the modified models MI 2* are the best fitting ones with the least necessary number of modifications. Further re-specification of the strictly invariant models would entail additional freeing of parameters and increase the capitalization of sample-specific properties (Steenkamp & Baumgartner, 1998, p. 81). As the course- and learning-related constructs adhere to the rules of Chen (2007) and to the recommended thresholds even when residual variances are equal, their strict factorially invariant versions

242

8

Results of the Longitudinal Study

will be used for the latent mean estimation throughout the semester due to their higher statistical power in terms of parsimony2 .

8.2

Average Motivational and Emotional Trajectories throughout the Semester

Having established (partial) measurement invariance based on the optimized models, it is reasonable to estimate and compare the average change in latent means in general while the determinants of potential trajectories will be further investigated within the inner structural models in the subsequent chapter. The latent means are identified by fixing the intercept of the reference indicators at zero (Wells, 2021, p. 255) to provide a meaningful metric. Table 8.4 indicates the mean development for all assessed constructs. Table 8.4 Latent Average Development throughout the Semester S

Mean

95% CI Mean Lower

Upper

t1

5.015

4.963

5.066

t5

4.865

4.791

4.939

t9

4.666

4.588

4.743

V

Mean

95% CI Mean Lower

Upper

t1

5.181

5.111

5.251

t5

4.693

4.616

4.771

t9

4.315

4.220

4.410

A

Mean

95% CI Mean Lower

Upper

t1

4.865

4.786

4.944

t5

4.726

4.641

4.812

d

D

Mean

Lower

Upper

t1

4.513

4.437

4.590

.152

t5

4.683

4.615

4.750

.182

t9

4.598

4.511

4.686

d

I

Mean

95% CI Mean Lower

Upper

t1

5.337

5.282

5.392

t5

4.951

4.885

5.017 4.790

.166

95% CI Mean

d

.160 .074 d

.166

.253

t9

4.702

4.614

d

E

Mean

95% CI Mean Lower

Upper

t1

5.305

5.249

5.360

t5

4.189

4.117

4.261

.084

.200 d

.966 (continued)

2

Even though strict scalar invariance would only be needed for inferences about observed sum scores, they still serve to further strengthen reliable conclusions reliably in the present context (Murray et al., 2017; Steenkamp & Baumgartner, 1998, p. 82).

8.2 Average Motivational and Emotional Trajectories …

243

Table 8.4 (continued) S

Mean

95% CI Mean Lower

Upper 4.638

t9

4.531

4.425

Jc

Mean

95% CI Mean Lower

Upper

d

D

Mean

95% CI Mean Lower

Upper 4.508

.117

t9

4.418

4.328

d

Hc

Mean

95% CI Mean Lower

Upper

d .178 d

t2

4.191

4.129

4.253

t2

1.908

1.847

1.968

t4

3.590

3.522

3.659

.478

t4

2.368

2.295

2.442

.412

t6

3.749

3.674

3.825

.120

t6

2.276

2.197

2.355

.077

t9

3.570

3.482

3.657

.129

t9

2.403

2.312

2.495

JL

Mean

95% CI Mean

d

HL

Mean

95% CI Mean

Lower

Upper

t3

3.811

3.746

3.876

t7

3.668

3.597

3.739

.108

Lower

Upper

t3

2.279

2.211

2.348

t7

3.044

2.963

3.124

.104 d

.557

The constructs are aligned in such a way that high values represent positive AME appraisals (i.e., higher affect represents a more positive affective attitude towards statistics)—except for difficulty and hopelessness, for which higher values represent a higher appraised complicacy or frustration, respectively. Similar to the results of Emmioglu et al (2018, p. 125), Stanisavljevic et al., (2014), and Tempelaar, van der Loeff, et al. (2007, p. 86) the means in general suggest that the attitudes and emotions towards statistics tend to be slightly positive considering the 7-point Likert scale with a neutral value of 4. The trends of the latent means throughout the semester also reveal that AME appraisals averagely become more negative. Most are consistently decreasing (self-efficacy, interest, value, affect, learning-related hopelessness, and enjoyment). Effort appraisals decrease by the midterm and become more positive again by the end of the semester. The decrease in course enjoyment and the increase in course hopelessness from the beginning to the mid of the semester is related with a moderate effect size. From the midterm to the second third of the semester, the differences are more or less stagnating and then continue an unremarkable negative trend until the end of the semester. Besides, effect sizes show that most effects in latent mean change are negligible. Apart from the course-related beginning emotions and learning-related hopelessness, the decrease in effort towards the mid of the semester is associated with a strong effect. Apart from these considerable latent mean differences, the other mean attitude levels are rather stable across time. Difficulty seems to be the

244

8

Results of the Longitudinal Study

most stable state from all constructs. χ2 difference tests (p = .002) of a model with equal means for difficulty still suggest that the changes are significant. Since this construct refers to attitudes about the domain of statistics rather than to attitudes about themselves—which taps into a prominent distinction made in attitude research, assuming that the first-mentioned domain attitudes are more resistant to change (Gal & Ginsburg, 1994; Vanhoof et al., 2011, p. 38). The temporal stability might stem from the fact that difficulty is the only construct that came from the more stable and longer-term EV dimension of goals and self-schemata (i.e., “perceptions of task demands”; see section 3.2.2; Eccles & Wigfield, 2020). Hence, the timeframe of one semester might be too short termed to represent changes in the perceptions of task difficulty. Further differentiation of the causeeffect chains might also be necessary for more meaningful information on the stability and relevance of each construct, which will be initiated in the following chapters.

8.3

Separate Reciprocal Causation Models

8.3.1

Modeling Approach and Goodness-of-Fits

In consideration of the general average tendency of trajectories, the best fitting latent-state measurement models will be extended to latent autoregressive structural models. The a priori worked out hypotheses will thereby be modelled by means of autoregressive, reciprocal, and cross-lagged paths between the respective latent state variables, quiz, and exam scores. Autoregressive paths address relations of the same construct across all measurement occasions and are to control for the temporal relationships within the same constructs (Burns et al., 2020). Reciprocal effects refer to the effects between EV appraisals, emotions, on the one hand, and achievement on the other hand (quiz and exam scores). Crosslagged relationships entail relations among chronologically adjacent emotions and EV appraisals only. While these relationships will be controlled for, they will only be mentioned occasionally as the main focus of the investigation is on the quiz-related reciprocal effects. The cross-lagged effects between motivational and emotional constructs thereby constitute an exception due to investigations of the CV theory of achievement emotions (see section 8.4.2). In a first step, each subgroup of AME appraisals in relation to the formative and summative outcomes over time will be investigated in separate structural models. Under consideration of (non-)significant paths, the insights from the separate models will be integrated into a holistic, yet preferably parsimonious, model

8.3 Separate Reciprocal Causation Models

245

that also accounts for the assumptions of the CV theory. Finally, these models will be checked for between-subjects differences according to the heterogeneity criteria (gender, prior knowledge) and the course design. The structural paths were modelled in such a way that EV and achievement emotions are related to the subsequent achievement outcomes, and vice versa, according to the assessment framework (see section 5.2.3). To sufficiently account for stability within AME appraisals and performance throughout the semester, first- and second-order autoregressive paths will be controlled for3 . This means that any construct or achievement variable controls for all prior path coefficients of the same construct or variable. Cross-lagged effects account for the relations between other neighboring AME appraisals apart from the achievement outcomes. The modelling of autoregressive and cross-lagged paths is illustrated exemplarily in Figure 8.1 and was done analogously in all subsequent models.

Figure 8.1 Modelling of Autoregressive and Cross-Lagged Paths Using the Example of the Expectancy Model

Reciprocal and cross-lagged effects from the measurement before last (i.e., quiz score 3 on self-efficacy at the end of the term) were not modelled because they were mostly insignificant and not a better predictor than the last measurement. Instead, only the closest relationships were modelled (i.e., quiz score 4 on self-efficacy at the end of the term). For the investigation of path coefficients and 3

The models with all higher-order autoregressive paths across the semester were compared to models that only included autoregressive paths between the same neighboring constructs. Modelling all autoregressive, higher-order paths did not lead to the rejection of any significant relations between attitudes, emotions, and achievement compared to the model with less autoregressive paths.

246

8

Results of the Longitudinal Study

inter-construct relations according to the research question, the minimum requirement is metric invariance, so that the scale intervals of the factors are comparable across time (Chen, 2007; Murray et al., 2017; Steenkamp & Baumgartner, 1998, p. 82). Therefore, different to the measurement models in section 8.2, the structural models only need to have equal factor loadings across time. For a clear identification of the model, each of the separate models was assigned with an identifier according to Table 8.5. Table 8.5 Identifiers for the Separate Structural Models

ME

Structural expectancy feedback model

MV

Structural value feedback model

MCE

Structural course emotion feedback model

MLE

Structural learning emotion feedback model

Table 8.6 presents the goodness-of-fit of the extended structural models under configural and weak factorial invariance conditions. Compared to the optimized measurement models and despite the further restrictions, the fit of the structurally extended models did not deteriorate considerably. Upon comparing the configural with weak invariance variant of each of the four models, the Satorra-Bentler difference statistics indicate a significant deterioration of the more restrictive models. The other model fit criteria, however, do not change beyond the recommended values for the additional constraints to the factor loadings. The greatest changes being CFI = −.006, RMSEA = .001 and SRMR = .003. Hence, it can be assumed that the factor loadings of the structural models are invariant across time. All fit criteria of the metrically invariant models except one lie within the recommended goodness-of-fit boundaries. The only mentionable decrease is the SRMR for MLE , which increased strongly and is greater than the recommended value of .08. The higher value suggests that the model-implied correlation matrices do not sufficiently account for the observed correlations. This discrepancy could stem from the fact that only two measurement occasions for learning emotions do not sufficiently account for the structural relationships with the quiz score. This issue might be solved when the separate models are integrated into more comprehensive models later. As the other fit criteria of this model are adequate, no further modifications were made. For the following separate model analyses will be introduced by a table displaying path coefficients along with their significance of all modelled reciprocal, autoregressive, and cross-lagged effects as a first overview for the considered

529.43c

χ2 567.93c

weak

.034

.031–.038

.053

.955

.940

SCF [%]

RMSEA

90% C.I.

SRMR

CFI

TLI

.939

.952

.056

.031–.038

.035

13.4

MV

.929

.942

.053

.031–.034

.032

10.5

779

2.037,56

conf.

V, I, A, E

.925

.936

.056

.032–.035

.033

10.1

803

176.25c

2,185.09c

weak

Notes. df = degrees of freedom, SCF = scaling correction factor

188

12.3

df

200

conf.

Invariance 28.62c

S, D

Factors

 SB-χ2

ME

Model

.938

.944

.056

.033–.036

.034

13.47

741

2,100.03

conf.

Jc , H c

MCE

Table 8.6 Structural Model Fit under Configural and Weak Invariance Conditions

.939

.943

.057

.034–.036

.034

13.06

762

68.64c

2,152.09c

weak

.960

.968

.096

.039–.046

.043

15.76

171

600.84

conf.

JL , H L

MLE

.960

.967

.098

.039–.046

.042

15.3

177

22.70c

618.49c

weak

8.3 Separate Reciprocal Causation Models 247

248

8

Results of the Longitudinal Study

relations. Next, graphics of the structural models reduced to the significant relations between motivation, emotions, and achievement outcomes will be depicted to gather the most decisive information for the later holistic models.

8.3.2

Expectancy-Feedback Model

Beginning with the structural relations between the expectancy constructs and the achievement outcomes, Table 8.7 provides an overview of all modelled causal relations. The summative exam was modelled only on the fourth quiz because it has more items than the first three ones and is more representative in terms of content (Azzi et al., 2015). For reasons of clarity, the graphical structural models only include significant paths up to the .10 level, even though all paths were modelled according to Table 8.7 (see Figure 8.2). The structural model reveals significant reciprocal relationships between the quiz score and self-efficacy even if the autoregressive main effects and crosslagged effects with priorly assessed motivational constructs are controlled. More concretely, self-efficacy significantly relates to the subsequent quiz score, and the quiz score in turn relates to the subsequent self-efficacy assessment. For example, with one unit increase on the 7-point Likert scale of self-efficacy at the beginning of the semester (S1), students perform better by 2.6 percentage points in the subsequent quiz (Q1). Subsequently, students that achieved 100% in the first quiz had an increase in their self-efficacy by 1.050 units of their subsequent selfefficacy in the middle of the semester (S5). The reciprocal effect fades in the second half of the semester, so that self-efficacy at the midterm only has a less significant effect on Q3, but not on Q4, and Q3 does not impact self-efficacy at the end of the semester (S9). This fading effect could also be due to the slightly wider timespan between S5 → Q4 and Q3 → S9. Accordingly, the fourth quiz, representing the mock exam, again had a significantly positive effect on S9, which in turn positively impacts the final exam score. With each additional unit of self-efficacy, students perform 1.2 percentage points better on the exam. Hence, the positive effect of self-efficacy on achievement carries through to the end of the semester. It could thus be assumed that the quiz feedback compensates

>.100

.464c

Q3 → Q4

−.130b

−.063

−.098b

−.062

D2 ↔ S2

−.138c

.267c

Q2 → Q4

−.138c

D1 → S2

.155c

Q1 → Q4

.290c

D1 → D9

Latent residual correlations S2 → D3

.782c

Q2 → Q3

.485c

D5 → D9

D1 ↔ S1

.290c

.612c

.493c

D1 → D5

>.100

.113

Q4 → D9

.000

.695c

Q4 → S9

S1 → D2

Q1 → Q3

Q1 → Q2

.247c

S1 → S9

>.100

−.006

D5 → Q4

>.100

.001

S5 → Q4

Cross-lagged paths

.610c

.505c

>.100

.009

D5 → Q3

.035

.019b

S5 → Q3

S1 → D2

S5 → S9

S1 → S5

Autoregressive paths

>.100

−.046

−.009 >.100

Q1 → D5

D1 → Q2

.001

D1 → Q1

.000

1.050c

Q1 → S5

.039

.014b

.001

S1 → Q2

.026c

Reciprocal paths

S1 → Q1

Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.

Expectancy, Difficulty

Quiz

Expectancy

p

Difficulty

p

Self-Efficacy

Model ME

Table 8.7 Structural Relationships between Quiz and Expectancy Factors

−.025

D3 ↔ S3

.345c

Q4 → E

.000

−.027c

D9 → E

.034

.012b

S9 → E

8.3 Separate Reciprocal Causation Models 249

250

8

Q1

Q2

.782

Q3

Results of the Longitudinal Study

.464

Q4

.345

S1

S5

S9

D1

D5

D9

E

MF d1/d3

Figure 8.2 Path Diagram for the Relationships between Quiz and Expectancy Factors. (Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3; MF = method factor)

average decrease of self-efficacy throughout the semester. By contrast, there is no significant interdependency between Difficulty and quiz scores4 . However, difficulty relates to the final exam score. Students who perceive statistics to be more difficult by one unit, perform worse in the exam by 2.7 percentage points. Regarding the cross-lagged effects a student’s stronger belief of students in the own capabilities attenuates the perceived task difficulty. Difficulty however does not influence subsequent self-efficacy. The overall lack of influence and influenceability conforms to the mostly constant latent means of the construct (section 8.2). For the later integration of the separate models, it could be considered to omit this construct under consideration of the explained variance. A closer look at the autoregressive processes reveals that these constructs significantly relate to each of their respective previous measurements, whereby the directly preceding autoregressive effect was always the better predictor for the current occasion than those further back in time5 . 4

As a preliminary test of a potential inverted u-shaped relationship, as mentioned in section 3.2.6, a scatterplot was consulted. A very vague u-shaped relationship became apparent, which however did not account for considerably more variance in the quiz performance compared to a linear relationship. 5 As this systematic of the autoregressive effects is similar for the subsequent models, they will not be mentioned anymore in the subsequent chapters.

8.3 Separate Reciprocal Causation Models

8.3.3

251

Value-Feedback Model

Analogous the procedure of the previous subchapter, the structural relations between achievement and the value constructs (including effort) will be examined, beginning with the modelled reciprocal, cross-lagged and structural paths in tabular form (Table 8.8). Table 8.8 Structural Relationships between Quiz and Value Factors Model MV Reciprocal paths Value p

V1 → Q1 V1 → Q2 Q1 → V5 V5 → Q3 V5 → Q4 Q4 → V9 V9 → E .006

.011a

.713c

.001

−.004

.369a

.006

>.100

.053

.001

>.100

>.100

.083

>.100

I5 → Q3

I5 → Q4

Q4 → I9

I9 → E −.009 >.100

Interest I1 → Q1 p Affect p Effort p

I1 → Q2

Q1 → I5

.020b

.001

.868c

−.004

.003

.458b

.037

>.100

.000

>.100

>.100

.024

A1 → Q1 A1 → Q2 Q1 → A5 A5 → Q3 A5 → Q4 Q4 → A9 A9 → E .008a

.005

1.240c

.010a

.011b

.085

.012b

.087

>.100

.000

.061

.014

>.100

.012

E1 → Q1 E1 → Q2 Q1 → E5 E5 → Q3 E5 → Q4 Q4 → E9 E9 → E .009

.007

1.048c

.047c

.008

.835c

−.002

>.100

>.100

.000

.000

>.100

.000

>.100

I5 → I9

I1 → I9

.580c

.216c

E5 → E9

E1 → E9

.826c

.025

Autoregressive paths Value, V1 → V5 V5 → V9 V1 → V9 I1 → I5 Interest .556c .555c .226c .494c Affect, Effort

A1 → A5 A5 → A9 A1 → A9 E1 → E9

Quiz

Q1 → Q2 Q2 → Q3 Q3 → Q4 Q1 → Q4 Q2 → Q4 Q1 → Q3 Q4 → E

.520c .604c

.536c .745c

.158b .457c

.523c .144c

.254c

.269c

I1 → A5

I1 → E5

.358c

Cross-lagged paths t1→ t5

V1 → I5 .114b

V1 → A5 V1 → E5 I1 → V5 .033

−.036

.132b

.121a

−.065 (continued)

252

8

Results of the Longitudinal Study

Table 8.8 (continued) Model MV Reciprocal paths

t5→ t9

A1 → I5

A1 → V5 A1 → E5 E1 → V5 E1 → I5

−.007

−.051

V5 → I9

V5 → A9 V5 → E9 I5 → V9

.035 A5 → I9 .028

−.076

−.089c .053

.012 .070

.094b I5 → A9 .242c

E1 → A2 −.068 I5 → E9 −.061

A5 → V9 A5 → E9 E5 → V9 E5 → I9

E5 → A9

−.053

−.071b

−.089

−.058

−.067

I1 ↔ A1

I1 ↔ E1

A1 ↔ E1

.409c

.411c

I5 ↔ A5

I5 ↔ E5

.320c

.397c

.022

I9 ↔ A9

I9 ↔ E9

A9 ↔ E9

.276c

.198c

.004

Correlated latent residuals t1

V1 ↔ I1 .965c

t5

V5 ↔ I5 .712c

t9

V9 ↔ I9 .647c

V1 ↔ A1 V1 ↔ E1 .371c

.310c

V5 ↔ A5 V5 ↔ E5 .322c

.207c

V9 ↔ A9 V9 ↔ E9 .262c

.218c

−.356c A5 ↔ E5

Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.

For a better illustration of the structural relations, Figure 8.3 is once again reduced to the significant reciprocal relations between attitudes and quiz scores. Concerning the reciprocal effects between quiz scores and value appraisals at the beginning of the semester, affect, value, and interest positively relate to either the first or the second quiz score. For instance, the higher students’ positive affective appraisal (i.e., feeling less anxious), the better the quiz score on average. The beginning effect on the first quiz scores is less significant than that of self-efficacy. Only the effect of affect on quiz score 1 is significant at p < .05 while the other two effects are significant at the .10-level. The quiz score 1 highly significant impacts all four subsequent value components. Students who achieved the full score in the quiz felt on average 1.24 units less anxiety and stress (Affect) at the midterm. The quiz score also fosters subsequent interest and value appraisals. The higher the quiz score, the better the interest and value appraisal in the middle of the semester. The performance in the quiz also leads to higher intention to study hard for the statistics course (Effort). More concretely, students that already performed well in the quiz plan to invest more effort into learning statistics. Hence, quiz performance rather seems to have a spurring rather

8.3 Separate Reciprocal Causation Models

Q1

A1

.604

Q2

A5

.745

Q3

253

.457

Q4

E

A9

MFa5

I1

I5

I9

V1

V5

V9

MFv1 E1

E5

E9

MFe1

Figure 8.3 Path Diagram for the Relationships between Quiz and Value Factors. (Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3; MF = method factor)

than a regulatory or attenuating effect on learning statistics for well-performing students. Concerning the second half of the semester, affect and effort significantly influence quiz performance. Students who feel less stressed or anxious by one unit, achieve on average 9 to 11 more percentage points in the subsequent quiz. Students who plan to invest more effort into their statistics course also achieve a better quiz score. Thus, there is a continuous reciprocal relation between quiz scores on the one hand, and appraised effort and affect, on the other hand, until the end of the semester. There is however no significant relation between quiz 4 and final affect as well as between effort and the subsequent exam score. From the four components, only affect has a significant impact on the exam in such a way that less anxious students perform better in the exam. The reciprocal relations between value, interest and the quiz score only sustain until the second quiz. Only performance in the fourth quiz influences subsequent statistics-related interest. Similar to the expectancy constructs, the height of quiz effect on value appraisals fades throughout the semester and are most present and pronounced at the beginning of the semester. The cross-lagged effects are in line with expectations in such a way that positive appraisals foster each other. For instance, value and interest positively relate to each other at the subsequent measurement occasions. There

254

8

Results of the Longitudinal Study

also is a diminutive effect in such a way that less anxious and stressed students tend to invest less effort at the subsequent measurement occasion. The significance of all cross-lagged effects also decreases in the second half of the semester. Concerning the average development throughout the semester (see section 8.2), the positive reciprocal quiz effects again suggest that they might at least compensate the mean decrease of all value components. Moreover, the latent residual correlations at the same measurement occasion are in some constellations higher compared to the other structural models. While most correlations are moderate and are not greater than .40, correlations between interest and value are greater than .60 and suggest that there left-out predictors or covariates not sufficiently account for the covariance between these factors. This indicates the necessity of analyzing the structural models also in different groups according to gender, prior knowledge, and design (see chapter 9).

8.3.4

Emotion-Feedback Models

For the modeling of the paths for the relationships between course emotions and achievement, some specifics have to be considered first. In the 2018 cohort, the in-course assessment of Jc 2, Hc 2 and Jc 4, Hc 4 fell into the one-week processing period of the first and second quiz, respectively. To guarantee for the temporal causality and consistent reciprocal modeling of the paths Jc 2, Hc 2 → Quiz 1 and Jc 4, Hc 4 → Quiz 2, all participants who had already completed the quizzes before the course assessment were excluded from the sample6 . Moreover, the effect of Jc 6 and Hc 6 on quiz score 3 have not been modelled because their chronology differed in both cohorts7 . Finally, only the effects between the immediately neighboring emotional states and quizzes were modelled because constructs that were set wider apart were insignificant in all cases. Apart from these exceptions, all reciprocal, cross-lagged, and autoregressive relations were modelled according to Table 8.9.

6

For both paths, only about 40 participants had to be excluded. Exclusion was possible on grounds of the administrative quiz data, which indicated the date and time at which students submitted their quiz. Selection effects could be ruled out since descriptive statistics (mean, standard deviation etc.) of both the complete and cut down quiz scores are remained nearly identical. 7 In the 2017 cohort, emotions at t were assessed before quiz 3 and vice versa for the 2018 6 cohort.

−.049

H c 2 → Jc 4

.051

H c 4 → Jc 6

−.038 .006

H c 6 → Jc 9

−.040

−.083b

.461c Jc 6 → H c 9

.821c

.586c

Q3 → Q4

Jc 4 → H c 6

Q2 → Q3

Q1 → Q2

.388c

Hc 6 → Hc 9

.678c

Jc 6 → Jc 9

.012

Jc 2 → H c 4

.491c

.487c

Hc 2 → Hc 6

.644c

.737c

Hc 2 → Hc 4

Jc 4 → Jc 6

Jc 2 → Jc 4

Autoregressive paths

.001

−.014b

Hc 4 → Q2

.000

Jc 6 → Q4

.258c

Q2 → Q4

.240c

Hc 4 → Hc 9

.315c

Jc 4 → Jc 9

>.100

−.007

Hc 6 → Q4

>.100

−.001

−.451c

Jc 2 ↔ H c 2

−.303c

Jc 4 ↔ H c 4

Correlated latent residuals

.162c

Q1 → Q4

.208b

Hc 2 → Hc 6

.209c

Jc 2 → Jc 6

.031

−.477b

Q2 → Hc 6

>.100

.020

Q2 → Jc 6

Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.

Course enjoyment, hopeless

Cross-lagged paths

Quiz

Course enjoyment, hopeless

.000

−.706c

−.037c

p

Q1 → Hc 4

Hc 2 → Q1

Course Hopeless

.021

.018c

.472b

.004

>.100

Jc 4 → Q2

Q1 → Jc 4

Jc 2 → Q1

p

Course Enjoyment

Reciprocal paths

Model MCE

Table 8.9 Structural Relationships between Quiz and Course Emotions

−.223c

Jc 6 ↔ H c 6

.288c

Q1 → Q3

.089

Hc 2 → Hc 9

−.095

Jc 2 → Jc 9

.000

−.606c

Q4 → Hc 9

.005

.434c

Q4 → Jc 9

−.080

Jc 9 ↔ H c 9

.347c

Q4 → E

.000

−.021c

Hc 9 → E

.010

−.010b

Jc 9 → E

8.3 Separate Reciprocal Causation Models 255

256

8

Results of the Longitudinal Study

Figure 8.4 illustrates all relations that were significant at the .10-level.

Q1

Q2

.821

Q3

Q4

.347

JC2

JC4

JC6

JC9

HC2

HC4

HC6

HC9

EX

Figure 8.4 Path Diagram for the Relationships between Quiz and Course Emotion Factors. (Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3)

The path diagram depicts predominantly reciprocal relationships between course emotions and achievement. For course hopelessness, the pattern is most consistent. The greater the course hopelessness at the beginning of the semester (t2 ), the worse students perform in the subsequent first quiz. An increase in hopelessness by one unit precipitates a decrease in quiz achievement by 3.7 percentage points, which conforms to its negative deactivating mode of action. This also applies when the direction of effect goes from quiz to hopelessness. Hopelessness of students who achieved the full score in the quiz decreased by .706 units at t4 . In turn, students with a lower level of hopelessness once again performed better at quiz 2 and also were less frustrated at t6 and t9 . Moreover, lower hopelessness in the last week of the semester related to a better exam score. The reciprocal pattern for course enjoyment is less consistent and shortened as compared to hopelessness. More concretely, initial enjoyment has no significant impact on the first quiz score, and the second quiz does not significantly relate to subsequent course enjoyment. The reciprocity entails the positive effect of the first quiz on Jc 4, which also translates into better performance in quiz 3, suggesting the positive activating impetus of enjoyment. The positive effect resumes for the effect of the fourth quiz on enjoyment at the end of the semester. Afterwards, there is a negative effect of course enjoyment on the exam score, which is not in line with expectations. The more students enjoyed the statistics course, the worse they performed in the final exam. A first assumption for the discrepancy could be the different methods of examinations, as the positive relationship with the quiz scores conforms to theory and empirical findings. The negative effect has to be differentiated in the subsequent multiple group analyses for a more qualified evaluation. For both emotions, significant effects are interrupted between t6 and

8.3 Separate Reciprocal Causation Models

257

t9 , i.e., Jc 6 and Hc 6 do not impact subsequent quiz performance. This conforms to the rather stagnating average development of course emotions and the decrease in effect size by the end of the semester (Cohen’s d; see section 8.2). However, by the end of the semester, quiz effects become again highly significant, suggesting a stronger influenceability of appraisals shortly before the exam. Next, it is of interest whether the course-related mechanisms also apply to the learning-related context. The peculiarities in path modeling for course emotions (i.e., due to different timing in both cohorts) were not relevant for the learning-related counterpart, so that all possible relationships could be modelled straightforward. Since assessment of learning-related emotions was not as closemeshed as for course emotions (only two measurement occasions), the assessment at t3 (JL 3, HL 3) is related to both the second and third quiz. Table 8.10 and Figure 8.5 show the unstandardized regression coefficients in the known manner. Similar to course enjoyment, higher enjoyment outside the course positively predicts subsequent quiz scores. Quiz scores do however not significantly impact learning-related enjoyment. Hence, the reciprocal effect of course enjoyment was not found outside the course. The counterintuitive negative relation between enjoyment and exam was also found for the learning emotion, even though it is only significant at the .10-level. While learning hopelessness does not seem to have a consistent impact on quiz performance compared to course hopelessness (i.e., only HL 3 impacts quiz 3 at the .10-level), the path coefficient of quiz 2 on subsequent learning hopelessness is considerably higher (twice to thrice in magnitude) than those effects found for course hopelessness. Even though comparisons of unstandardized coefficient should be treated with caution, this finding suggests that quiz scores are more impactful for thwarting hopelessness while learning outside the course than in the course itself. Cross-lagged effects for both learning and course emotions reveal a negative relation between adjacent hopelessness and enjoyment. However, enjoyment relates more frequently in a benevolent manner to subsequent hopelessness than vice versa. This suggests that higher manifestations of the positive activating emotion have a wider influence on hopelessness. The autoregressive effects suggest that neighboring emotional appraisals involve stable components to a higher degree. Course hopelessness, by contrast, has lower autoregressive path coefficients compared to enjoyment, suggesting a higher volatility. Even though all higher-order autoregressive paths controlled for the stability of the emotional states, the reciprocal effects between emotions and achievement were still significant and carried through the entire course of the semester. It has to be noted however that the regression coefficients from quiz to subsequent course- and learning-related enjoyment were rather small compared to the EV constructs and

−.026 Q2 → Q3 .790c

−.172c

Q1 → Q2

.582c

.465c

Q3 → Q4

.503c

HL 3 → HL 7

Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.

Quiz

H L 3 → JL 7

JL 3 → H L 7

.000

>.100

.157c

Q1 → Q4

.671c

JL 3 → JL 7

>.100

−.003

HL 7 → Q4

.052

.263c

Q2 → Q4

.000

−.014c

HL 7 → E

.281c

Q1 → Q3

−.451c

JL 7 ↔ H L 7

.341c

Q4 → E

8

Learning emotions

.083

−.006

>.100

−1.395c

−.011a

>.100 Q2 → HL 7

.042 HL 3 → Q3

HL 3 → Q2

.000

−.303c

.004

JL 3 ↔ H L 3

−.008a

.012b

.017c .082

Lat. Resid JL 7 → E

JL 7 → Q4

JL 3 → Q3

Q2 → JL 7

Reciprocal effects

JL 3 → Q2

Cross-lagged & autoregressive effects

p

Learning hopeless

p

Learning Joy

Model MLE

Table 8.10 Structural Relationships between Quiz and Learning Emotions

258 Results of the Longitudinal Study

8.4 Empirical Modeling of the Integrative Model …

Q1

Q2

.790

Q3

JL3

JL7

HL3

HL7

259

Q4

.341

EX

Figure 8.5 Path Diagram for the Relationships between Quiz and Learning Emotion Factors. (Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3)

hopelessness, suggesting that it might be less relevant in explaining feedback processes throughout the semester. With the insight from separate models, the next subchapter aims to integrate the EV and emotion-related components into more holistic structural models.

8.4

Empirical Modeling of the Integrative Model of Achievement Motivation and Emotion

8.4.1

The Need for Downsizing the Expectancy-Value Causation Model

In seeking to further extend the above separately analyzed structural models and to replicate the EV model of achievement motivation Wigfield & Eccles (Wiegfield & Eccles, 2002) according to the postulated FRAM model in section 3.2.3, the EV constructs were coalesced into one structural model MFRAM . This model has six factors at three measurement occasions and the achievement variables. The resulting structural model however has issues with a non-positive definite latent variable covariance matrix. The construct S5 has a negative residual variance. Kolenikov & Bollen (2012) enumerate several reasons for such Heywood cases, among which small sampling fluctuations or structural misspecification are the most likely. In default of ad hoc possibilities to detect reasons for negative residual variances, the most likely assumption for the present case is model misspecification due to overparameterization in the more restrictive, holistic model.

260

8

Results of the Longitudinal Study

Moreover, inhomogeneous indicators with cross-loadings that are forced to load onto one factor only might cause problems when their correlative structure in the data does not match with the enforced factor assignment. Recalling the findings from section 7.2.3, the item s2 had considerable cross-loadings with the affect construct at all measurement occasions. It should be noted that in educational research, different views suggest that self-efficacy or self-concept perceptions should also include emotional reactions (Bong & Skaalvik, 2003, p. 7). These notions might resonate in the higher correlation between both constructs when combining all expectancy and value constructs in one model and might be a reason for the negative residual variance. Only the omission of the affect construct from the complete model prevents the non-positive definite covariance matrix. Even though affect was found to interrelate positively with quiz and exam scores (see section 8.3.3), the construct tends to be negligible when calling to mind that it did not directly spring from EV theory, but was composed in the context of the SATS-M. The new structural model with five factors has appropriate fit criteria (X2 = 2,913.63, df = 1167, CFI = .928, TLI = .915, SRMR = .053, RMSEA = .031). The difficulty construct still stands out due to its lacking influenceability by other constructs and due to its lacking influence on other constructs (see section 8.3.2), which persists in the holistic model with the additional value constructs. There is only one new significant relation, whereby a higher difficulty appraisal at t1 leads to students perceiving to invest more effort at t5 (p = .054). Even though this relationship and the ones reported in section 8.3.2 are in line with expectations, the construct itself seems to be negligible in the holistic framework due to the small number of other significant relationships. The questionable relevance is aggravated by several methodological issues with the construct. Two method factors were needed to achieve good fit criteria and indications for adequate reliability and validity (see section 7.1.3). Moreover, difficulty may be a formative construct which would not fit well in with the other reflective constructs (see section 7.1.1). From a theoretical point of view, the difficulty construct conforms to perceptions of task demands under the umbrella of the longer-term, personal self-schemata while self-efficacy is subsumed under the shorter-term expectations of success (see section 3.2.2). This might be the reason why difficulty is less malleable and thus less meaningful in the context of this situated microanalytic study. The omission of the two constructs was also done in anticipation of the integrative structural analyses relating to the CV theory of achievement emotions. To integrate the motivational and emotional constructs, a reduced, but still meaningful baseline model representing the EV framework is necessary to

8.4 Empirical Modeling of the Integrative Model …

261

avoid an overcomplex model with eight constructs at seven measurement occasions. In that regard, affect and difficulty are most dispensable regarding their theory-related relevance and their statistical performance. Figure 8.6 shows the new structural model with four factors.

Q1

.605

Q2

.752

Q3 .461

E

Q4

S1

S5

S9

V1

V5

V9

I1

I5

I9

MF v4 E1

E5

E9

MF e1

Figure 8.6 Path Diagram for the Relationships between Quiz and Expectancy-Value Constructs. (Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3; MF = method factor)

The model fit of the four-factor model is acceptable (X2 = 2,238.44, df = 815, CFI = .931, TLI = .919, SRMR = .052, RMSEA = .034) and most reciprocal and cross-lagged relationships conformed to those of the separate models. The main difference is that self-efficacy and value do not longer significantly relate to subsequent quiz performance at t1 and t5. It seems that these effects are channeled through the impact of previous interest (t1 ) and effort (t5 ) on quiz performance. The insignificance of prior self-efficacy in the four-factor model comes again into play when investigating interaction effects between expectancy and value. Another mentionable relation is that between self-efficacy and effort. While effort positively contributes to subsequent self-efficacy, the relationship is inverse when the effective direction goes from self-efficacy to effort. In other words, higher self-efficacy leads to a downward adaptation in subsequent effort at t9 . In a next step, it has to be tested whether the present, reduced version of the model MFRAM serves as an appropriate basis to extend it to the FRAME model

262

8

Results of the Longitudinal Study

as postulated in section 3.6. The first attempt to combine the four remaining constructs of the EV constructs together with the course- and learning-related emotions again resulted in a non-positive definite covariance matrix. Due to the acceptable fit of all separate models, misspecification might again be due to overparameterization, requiring an even more economical model. This methodical constraint can however also be construed as an opportunity for a more differentiated view on the synergetic functioning of the CV appraisals. More concretely, splitting up the models in a value- and interest-related control-value model allows for the investigation of interaction effects (as postulated in section 3.4) between self-efficacy and interest in one model, and self-efficacy and value in the second model. Along these lines, the models will be split up in such a way that the self-efficacy construct is combined with each of the two other remaining value constructs (interest and value) in separate models. Effort will be retained in all these models as theoretically important achievement motivation, which was found to consistently interrelate with quiz scores and also with self-efficacy. It is considered justifiable to separate these constructs because they open different perspectives (i.e., interest as intrinsic value and value as utility value) on the impact of value appraisals on achievement and their reciprocity with expectancy, and achievement emotions. Moreover, the separate combination of two pairs of EV facets allows for modelling latent interaction terms to investigate the influence of a multiplicative conjunction of both appraisals (i.e., self-efficacy × interest, self-efficacy × value). Testing the inclusion of the moderator effects in these fewer complex models compensates for the added complexity for these computations (Murphy & Russell, 2017, p. 562). Before attempting to merge the EV and achievement emotion constructs, the two new separate EV models (MEI and MEV ) will be compared with the separate prior models (ME and MV ) as a consistency check. Table 8.11 provides model identifiers for an explicit reference to the models in the following chapters. Table 8.11 Identifiers for the Separate Structural Expectancy-Value Models MFRAM

Expectancy-value feedback model (without affect and difficulty)

MEI

MFRAM without value

MEV

MFRAM without interest

The model fit of both models is appropriate (for detailed fit information and the graphical structural models see Appendix 8 in the electronic supplementary material) and the reciprocal relations between EV appraisals and achievement in

8.4 Empirical Modeling of the Integrative Model …

263

both models MEV and MEI mostly conform to those of the separate expectancyand value models. Regarding MEV , the effect of value at t1 and t5 on the subsequent quiz score remains significant at the .10-level. Self-efficacy at t1 again positively relates to subsequent quiz performance (p = .039), which conforms to the separate model, but not to the four-factor structural model. Only in model MEV , effort negatively relates to exam performance (p = .059). This finding is contrary to the positive relationships found for effort and quiz performance and could be due to the different conditions of both testing formats (i.e., time limitations for the exam etc.). Regarding MEI , self-efficacy at t1 does not impact the subsequent quiz score, contrary to the separate model and the model MEV , confirming that most of the influenceability of subsequent quiz is channeled through the interest component. Interest negatively relates to the exam score (p = .074), which is rather counterintuitive. Concerning the cross-lagged effects in both models, only two differences were found compared to the separate models; interest and value at t5 relate significantly positive with effort at t9 in MEI and MEV , respectively. Moreover, in both models, higher self-efficacy diminishes the self-appraised effort invested in learning statistics. In sum, the effects originating from quiz performance on EV appraisals are stable across all models. Two counterintuitive, negative effects were found for the effect of interest and effort at t9 on the exam score, which however only prevail in the models MEI and MEV (see Appendix 8 in the electronic supplementary material). Other than that, most of the effects from the former separate models remain significant in the new ones, too. Moreover, self-efficacy only has a significant impact on quiz score 2 in the value model while this effect is insignificant in MEI . Hence, splitting up the models seems to be tenable because some effects seem to cancel each other out in the four-factor model. Further potential reasons for the partially differing functioning of the effects between both models will be investigated later by means of latent interactions to see whether self-efficacy, value, and interest have different synergetic relations. Beforehand, the EV facets will be enhanced by the inclusion of achievement emotions to appropriately represent the CV theory of achievement emotions.

264

8.4.2

8

Results of the Longitudinal Study

Integrating Achievement Motivation and Emotion into Additive Control-Value Causation Models

To receive a more differentiated picture of the separate emotional contexts, learning- and course emotions are first of all integrated separately into the structural EV models. Course- and learning-related emotions8 will be integrated into MEV and MEI to check for main effects between AME appraisals according to the CV theory. As explained in section 8.4.1, value and interest will be modelled separately to avoid overparameterization. Table 8.12 once again provides an overview of the different models that were built in the attempt to merge EV and achievement emotion constructs. Table 8.12 Identifiers for the Structural Control-Value Models MFRAME

Additive control-value feedback model of achievement emotions (with all constructs except for affect and difficulty)

MCI

MFRAME without value (control-interest)

MCV

MFRAME without interest (control-utility value)

Table 8.13 shows the model fit for the original model including both interest and value, which had yielded a non-positive definite covariance matrix (MFRAME ), and the interest- and value-related CV causation models (MCI and MCV ). Besides the non-positive definite latent covariance matrix, the suboptimal model fit of MFRAME (CFI/TLI < .90) indicates that it does not sufficiently fit to the data. This may stem from the high number of constructs and measurement occasions with fixed factor loadings across time. Model fit for MCI and MCV are adequate as the RMSEA is smaller than .05, CFI and TLI are greater than .90 and the SRMR is near the “good”-threshold of .05. Even though including a high number of constructs and the broadest representation of the learning process throughout the semester with eight measurement occasions, both models yield a

8

Learning-related and course-related emotions each were also considered separately in conjunction with the control-value constructs. For the following analyses, however, setting up a complete model was deemed more informative to reduce the total number of separate models and to provide a more valid representation of the learning process throughout the semester. The simultaneous modeling also provides the opportunity to appraise the relevance of either emotional modality while controlling for the other one and their potential differential functioning, also regarding the subsequent multiple group analyses.

8.4 Empirical Modeling of the Integrative Model … Table 8.13 Comparative Model fit of three different control-value models

265

Model

MFRAME *

MCI

MCV

Name of factors

S, I, V, E, Jc , Hc

S, I, E, Jc , H c , JL , H L

S, V, E, Jc , H c , JL , H L

χ2

7,408.01

7,322.55

7,535.38

df

2949

3237

3485

SCF [%]

10.37

11.78

10.99

RMSEA

.031

.029

.027

90% C.I.

.030–.032

.028–.029

.027–.028

SRMR

.066

.065

.065

CFI

.897

.919

.919

TLI

.890

.913

.914

AIC

263,961.94

278,844.86

291,60.90

BIC

271,556.79

286,942.11

299,96.04

Notes. * This model has a non-positive definite covariance matrix and is therefore not tenable; df = degrees of freedom, SCF = scaling correction factor

fairly acceptable overall fit. Following the general evaluation of the model quality, Table 8.14 shows the path coefficients for the CV model including the interest construct (MCI )9 . The path model in Figure 8.7 depicts all significant reciprocal relationships: The path model reveals that a majority of the reciprocal interrelations remain intact and consistent with those of the separate structural models. Compared to the separate EV and emotion models, most regression weights remain stable, with few exceptions mostly pertaining to the emotional interrelations. Apart from the impacts of initial self-efficacy (S1, S5, and S9) on subsequent quiz performance, the relationships of Q1 → Hc 4, Q1 → Jc 4, Q2 → Hc 6, JL 7 → E, HL 7 → E became insignificant. This suggests that EV appraisals dominate the reciprocal feedback process throughout the semester. Due to the insignificance of the relationship of Q2 → Hc 6, no course emotion construct of t6 seems to be relevant anymore while the quiz impact on learning-related hopelessness at t7 seems to be comparably more relevant. Shifting the focus to the CV theory of achievement emotions,

9

For reasons of brevity, coefficients for the autoregressive paths, cross-lagged paths within the expectancy-value and emotion constructs, and correlated residuals are included in Appendix 9 to Appendix 14 in the electronic supplementary material.

I1 → Q2

I1 → Q1

Jc 4 → Q2

Jc 2 → Q1

Q2 → JL 7 −.189

Q2 → Jc 6 −.243

HL 7 → Q4 −.004

Q2 → HL 7 −.671c

−.079

.003

JL 7 → Q4

−.010

Hc 6 → Q4

−.003

Jc 6 → Q4

.003

E5 → Q4

.004

I5 → Q4

−.010

S5 → Q4

Q2 → Hc 6

Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.

.011

HL 3 → Q2

.001

JL 3 → Q2

−.001

−.188

−.019c

Hc 4 → Q3

−.031c

Q1 → Hc 4

.008

Hc 4 → Q2

.007

Jc 4 → Q3

.042c

Q1 → Jc 4

E5 → Q3

.825c

−.011

Q1 → E5

.333a

I5 → Q3

.018

.399b Q1 → I5

S5 → Q3

Q1 → S5

Hc 2 → Q1

.012a

.006

.006

−.007

E1 → Q2

E1 → Q1

−.004

.000

.015

.020b

S1 → Q2

S1 → Q1

Reciprocal paths

−.381b

Q4 → Hc 9

.293a

Q4 → Jc 9

.696c

Q4 → E9

.342a

Q4 → I9

.608c

Q4 → S9

−.008

HL 9 → E

−.003

JL 9 → E

−.007

Hc 9 → E

−.011

Jc 9 → E

−.007

E9 → E

−.005

I9 → E

.021a

S9 → E

8

JL, HL

Jc , H c

E

I

S

Model MCI

Table 8.14 Reciprocal Relationships between Quiz, Expectancy-Interest, and Emotion Factors

266 Results of the Longitudinal Study

8.4 Empirical Modeling of the Integrative Model …

267

S1

S5

S9

I1

I5

I9

E1

E5

E9

Q1

Q2

Q3

Q4

EX

JC2

JL3

JC4

JC6

JL7

JC9

HC2

HL3

HC4

HC6

HL7

HC9

Figure 8.7 Path Diagram for the Relationships between Quiz, Expectancy-Interest, and Emotion Factors. (Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3)

the cross-lagged interrelations of the EV and course emotions are tabulated in Table 8.15. The first striking observation is that the cross-lagged interrelations between EV appraisals and emotions are more often and more consistently significant than those between constructs within the same domain (i.e., constructs within the EV domain and within the achievement emotion domain; see Appendix 10 and Appendix 13 in the electronic supplementary material). This serves as a first indication of CV-related relations resonating in the structural models. The most consistent reciprocal interrelations throughout the semester are found between self-efficacy and course-/learning-related hopelessness, followed by the relationship between self-efficacy and course-/learning-related enjoyment as well as interest and enjoyment. Starting with self-efficacy, enjoyment, and hopelessness, students with higher appraisals of self-efficacious control conceive more subsequent enjoyment and less hopelessness. The significant interrelation of self-efficacy and course enjoyment fades during the last third of the semester while that of learning-related enjoyment and self-efficacy is consistent throughout the whole semester. Selfefficacy and course/learning hopelessness also consistently reciprocally influence each other negatively up until the final week of the semester. These relations are

.047

E1 → JL 3

.428c

I1 → JL 3

.019

E1 → HL 3

−.129c

I1 → HL 3

−.467c

S1 → HL 3

.018 HL 3 → S5

.077

JL 3 → E5

.221c

JL 3 → I5

.109c

E5 → JL 7 −.040

−.036

.061

I5 → JL 7

.169a

S5 → JL 7

HL 3 → E5

−.067

HL 3 → I5

−.132c

−.013

JL 3 → S5

E5 → Hc 6

−.151c

−.044

I5 → Hc 6

−.207b

S5 → Hc 6

.012

E5 → Jc 6

−.029

I5 → Jc 6

.219b

S5 → Jc 6

Hc 4 → E5

−.155c

Hc 4 → I5

−.238c

Hc 4 → S5

.073

Jc 4 → E5

Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.

E ↔ JL /HL

I ↔ JL /HL

.192b

S1 → JL 3

−.019

E1 → Hc 4

−.062

−.112b

E1 → Hc 2

I1 → Hc 4

I1 → Hc 2

−.080

−.257c

.002 S1 → Hc 4

S1 → Hc 2

E1 → Jc 4

.122c

Jc 4 → I5 −.043

−.043

E1 → Jc 2

.399c

I1 → Jc 4

.115b

.103a

.075

I1 → Jc 2

Jc 4 → S5

S1 → Jc 4

S1 → Jc 2

−.073

E5 → HL 7

−.044

I5 → HL 7

−.431c

S5 → HL 7

.004

E5 → Hc 9

.090

I5 → Hc 9

−.188a

S5 → Hc 9

−.001

E5 → Jc 9

.079

I5 → Jc 9

.077

S5 → Jc 9

−.001

JL 7 → E9

.194c

JL 7 → I9

.115a

JL 7 → S9

−.138a

Hc 6 → E9

−.040

Hc 6 → I9

−.122a

Hc 6 → S9

.084

Jc 6 → E9

.151b

Jc 6 → I9

.009

Jc 6 → S9

−.104a

HL 7 → E9

−.076

HL 7 → I9

−.050

HL 7 → S9

8

S ↔ JL /HL

E ↔ HC

I ↔ HC

S ↔ HC

E ↔ JC

I ↔ JC

S ↔ JC

Cross-lagged paths between course emotions and EV constructs

Table 8.15 Cross-lagged Relationships between Expectancy-Interest, and Emotion Factors

268 Results of the Longitudinal Study

8.4 Empirical Modeling of the Integrative Model …

269

in accordance with the CV theory as higher perceived control (i.e., self-efficacy) is assumed to decrease frustration and to increase enjoyment. Interest only has a significant impact on subsequent course/learning-related enjoyment and hopelessness at the beginning of the semester. Learning-related enjoyment also reciprocally relates to interest at the end of the semester. Conforming to the CV theory, higher interest is related to higher enjoyment and less hopelessness. In the further course of the semester, only the emotions significantly predict subsequent interest. While enjoyment further builds up subsequently perceived interest, hopelessly feeling students are less interested at the following assessment. The hopelessness-interest relation only remains significant until the first half while enjoyment-interest relation in both course and learning-related contexts lasts until the end of the semester. Prior effort is only positively relevant for course enjoyment at the start of the semester. Hence students that indicate to study harder also seem to start appreciating statistics. The otherwise lacking relevance of the effort construct within the structural models may stem from the fact that the “cost” aspect is not explicitly involved in the CV theory. Prior effort never contributes to significantly diminish subsequent hopelessness. By contrast, prior course- and learning-related hopelessness is more consistently negatively related to subsequent effort, underling its negative deactivating emotion mode of operation. Students with a higher frustration tend to reduce their engagement to learn statistics. Another pattern is that the effort-enjoyment relation seems more relevant at the beginning of the semester whereas effort-hopelessness is more consistent from the midterm until the end of the semester. This may be related to the fact that initial curiosity is more motivating whereas hopelessness becomes the dominant factor of performance motivation as the topics become more difficult and the exam approaches by the end of the semester. In a next step, replacing interest with value, the model MCV will be considered regarding potential differences with the separate models and MCI . Table 8.16 once again depicts the reciprocal paths. Figure 8.8 visualizes the significant paths for MCV . The reciprocal relations between enjoyment, hopelessness, self-efficacy, effort, and quiz remain mostly unchanged in terms of height and significance compared to MCI . However, the value construct itself does not significantly relate with quiz performance anymore as compared to the separate value model (see section 8.3.3). It remains to be seen whether this insignificance also holds for different groups in the MGA. The cross-lagged effects of MCV are shown in Table 8.17. Compared to MCI , most cross-lagged effects remain consistent when interest is replaced with value. The value construct seems to operate similarly to the interest

E1 → Q2 .003 Jc 4 → Q2

E1 → Q1

.010

Jc 2 → Q1

Q2 → JL 7 −.176

Q2 → Jc 6 −.224

JL 7 → Q4 −.004

Q2 → HL 7 −.689c

−.132

.004

JL 7 → Q4

−.008

Hc 6 → Q4

−.004

Jc 6 → Q4

.004

E5 → Q4

.004

V5 → Q4

−.003

S5 → Q4

Q2 → Hc 6

Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.

.002

HL 3 → Q2

.011

HL 3 → Q2

−.001

−.206

−.019c

.004 Hc 4 → Q3

−.031c

.028

Hc 4 → Q2

Q1 → Hc 4

Jc 4 → Q3

.040c

Q1 → Jc 4

E5 → Q3

.808c

−.005

V5 → Q3

.017

S5 → Q3

Q1 → E5

.154

Q1 → V5

.348b

Q1 → S5

Hc 2 → Q1

.010

.007

−.006

V1 → Q2

.007

−.006

S1 → Q2

V1 → Q1

.026b

S1 → Q1

Reciprocal paths

−.368b

−.008

HL 9 → E

−.004

JL 9 → E

−.007

Hc 9 → E

−.012 Q4 → Hc 9

Jc 9 → E .304a

−.007

E9 → E

−.001

V9 → E

.020a

S9 → E

Q4 → Jc 9

.695c

Q4 → E9

.274

Q4 → V9

.596c

Q4 → S9

8

JL, HL

Jc , H c

E

V

S

Model MCV

Table 8.16 Reciprocal Relationships between Quiz, Expectancy-Value, and Emotion Factors

270 Results of the Longitudinal Study

8.4 Empirical Modeling of the Integrative Model …

271

S1

S5

S9

V1

V5

V9

E1

E5

E9

Q1

Q2

Q3

Q4

EX

JC2

JL3

JC4

JC6

JL7

JC9

HC2

HL3

HC4

HC6

HL7

HC9

Figure 8.8 Path Diagram for the Relationships between Quiz, Expectancy-Value, and Emotion Factors. (Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3)

construct when considering that both constructs significantly impact course and learning enjoyment only at the beginning of the semester. In the further course of the semester, only prior course and learning enjoyment significantly influence statistics-related value, as was the case with interest. Hence, within the CV context, prior emotion seems to play a larger role in the motivational trajectories than prior value appraisals. The prior influence of the control, i.e., self-efficacy facet, seems to play a more consistent role for these trajectories, while the interest and value facets are only relevant at the beginning of the semester. In all, the functioning of the value construct in MCV is comparable to MCI . In all, the structural relations provided supporting evidence for relationships according to the CV theory of achievement emotions. While the interrelations between self-efficacy, hopelessness, and enjoyment were mostly consistent to the CV theory throughout the semester, only the effect of initial interest and value on achievement emotions was significant and faded in the further course of the semester. As regards the two different contexts of out-of-class and in-class emotions, no systematically differential pattern was found for hopelessness. As suggested by the state of research (see section 3.3.4), course enjoyment was more relevant for feedback processing in the complete and the separate models while learning enjoyment did not reciprocally relate with feedback. The question remains whether there are also multiplicative effects between the control (self-efficacy) and value (interest, value) constructs that foster subsequent achievement emotions and whether they

.130c

E1 → JL 3

.116c

V1 → JL 3

−.002

E1 → HL 3

−.054

V1 → HL 3

−.535c

S1 → HL 3

.005

E1 → Hc 4 HL 3 → S5

.066

JL 3 → E5

.185c

JL 3 → V5

.112c

−.039

HL 3 → E5

−.008

HL 3 → V5

.130c

−.033

E5 → JL 7

.016

V5 → JL 7

.182b

S5 → JL 7

−.017

JL 3 → S5

E5 → Hc 6

−.150c

.012

V5 → Hc 6

−.208b

S5 → Hc 6

.000

E5 → Jc 6

−.008

V5 → Jc 6

.203c

S5 → Jc 6

Hc 4 → E5

−.261c

Hc 4 → V5

−.244c

Hc 4 → S5

.063

Jc 4 → E5

.098a

Jc 4 → V5

.103b

Jc 4 → S5

Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.

E ↔ JL /HL

V ↔ JL /HL

.422c

S1 → JL 3

−.039

E1 → Hc 2

V1 → Hc 4 −.015

−.050

−.106a

−.306c

V1 → Hc 2

S1 → Hc 4

−.009

E1 → Jc 4

−.019

S1 → Hc 2

.200c

E1 → Jc 2

.150c

V1 → Jc 4

.064

.264c

V1 → Jc 2

S1 → Jc 4

S1 → Jc 2

−.082a

E5 → HL 7

.007

V5 → HL 7

.468c

S5 → HL 7

.017

E5 → Hc 9

.016

V5 → Hc 9

−.164a

S5 → Hc 9

.017

E5 → Jc 9

.042

V5 → Jc 9

.100

S5 → Jc 9

.016

JL 7 → E9

.145c

JL 7 → V9

.103a

JL 7 → S9

−.133a

Hc 6 → E9

−.032

Hc 6 → V9

−.130b

Hc 6 → S9

.091

Jc 6 → E9

.105

Jc 6 → V9

−.010

Jc 6 → S9

−.092

HL 7 → E9

−.045

HL 7 → V9

−.047

HL 7 → S9

8

S ↔ JL /HL

E ↔ HC

V ↔ HC

S ↔ HC

E ↔ JC

V ↔ JC

S ↔ JC

Cross-lagged paths between course emotions and EV constructs

Table 8.17 Cross-lagged Relationships between Expectancy-Interest, and Emotion Factors

272 Results of the Longitudinal Study

8.4 Empirical Modeling of the Integrative Model …

273

outlast the additive effects. Beforehand, the squared multiple correlation of both integrative models in this chapter will be shortly considered.

8.4.3

Squared Multiple Correlation of the Integrative Models

An indication for the evaluation of predictive relevance of the structural models lies in the squared multiple correlation, or r-squared, which indicate the share of the variance of the endogenous construct is accounted for by the other latent factors related to the construct (Weiber & Mühlhaus, 2014, p. 230). For the most comprehensive overview of the r-squares, the two additive model MCI and MCV will be used and depicted in Table 8.18. Since the constructs difficulty and affect were omitted from these two models, the r-square is reported from the separate expectancy and value structural models from sections 8.3.2 and 8.3.3: D5 = .197c , D9 = .433c , A5 = .342c , A9 = .494c . Except for difficulty at t5 , one third to one half of the variability of the constructs is accounted for by the related exogenous constructs. Table 8.18 Squared Multiple Correlation of the Endogenous Constructs from the Integrative Control-Value Models Model MCI

Model MCV

S5

.508

Hc 2

.093

S5

.508

Hc 2

.092

S9

.559

Hc 4

.334

S9

.566

Hc 4

.330

Hc 6

.460

V5

.428

Hc 6

.456

.562

V5 V9



Hc 9

.444

V9

Hc 9

.448

I5

.494



JL 3

.221

I5



JL 3

.161

I9

.554

JL 7

.496

I9



JL 7

.489

E5

.330

HL 3

.157

E5

.324

HL 3

.160

E9

.709

HL 7

.471

E9

.710

HL 7

.466

J2

.204

Quiz 2

.373

J2

.155

Quiz 2

.374

J4

.535

Quiz 3

.487

J4

.527

Quiz 3

.486

J6

.605

Quiz 4

.558

J6

.599

Quiz 4

.557

J9

.676

Exam

.373

J9

.671

Exam

.374

Note. All squared multiple correlations were significant at the .01 level.

274

8

Results of the Longitudinal Study

The squared multiple correlations suggest that, by the end of the semester, mostly around half of the variance of the AME appraisals is accounted for by the preceding variables. With a share of approximately two-thirds, effort and course enjoyment at t9 have the highest share of explained variance. The magnitude of the r-squares also varies scarcely when comparing the integrative models from this chapter with the separate models from section 8.3. This suggests that, apart from cross-lagged relationships from the varying constellations of constructs, autoregressive relationships contribute considerably to the share of explained variance. Regarding the quiz scores, one third to one half of their variance is accounted for by the preceding variables and constructs. Moreover, approximately 37% of the variability in the exam score is explained by means of the quiz scores and the AME appraisals. For a better classification of this magnitude, Cohen adds for consideration that in the behavioral sciences, not much of the variance of endogenous variables is predictable (1977, p. 78). According to the rules of thumb for latent variables from Hair et al., (2011), the shares of explained variance of most investigated constructs are in the weak rage at the beginning to mid of the semester (R2 < .5) while they approach and occasionally enter the moderate range (.33 < R2 < .67) by the end of the semester. Hence, the multiple squared correlations suggest that the structural models are considered have at least a moderate predictive value regarding students’ motivational, emotional, and cognitive trajectories throughout the semester.

8.4.4

The Examination of Multiplicative Effects within the Integrative Control-Value Causation Models

In order to enhance the additive perspective with the theoretically assumed multiplicative association between EV constructs, interaction terms between selfefficacy on the one hand, and interest and value on the other hand and their effect on performance and emotions will be modelled into MCI and MCV . Moderation effects are considered when the slope of a linear regression between two variables varies as a function of a third, exogenous variable (Murphy & Russell, 2017, p. 549), bringing nonlinear relations to the model. Such multiplicative relationships were also occasionally found in the context of the EV framework, even though they were mostly small (see section 3.4). In general, summative reviews of moderator studies, for instance in the organizational sciences, revealed that moderator effect sizes (f2 ) are usually low (Murphy & Russell, 2017, p. 551). The slightly different relationships found in the models MEI and MEV also suggest that different mechanism might underlie the cognitive development throughout

8.4 Empirical Modeling of the Integrative Model …

275

the semester. As the separate models revealed that there is mostly no significant influence of prior expectancy, value, or achievement emotion on subsequent achievement after the second half of the semester, it could be assumed that only initial interaction effects are relevant, if any. For the testing of moderation effects, Mplus provides the approach of latent moderated SEM (Klein & Moosbrugger, 2000). The approach adjusts the distributional assumptions of the ML estimation to the non-normality produced by the latent interaction terms (Klein & Moosbrugger, 2000, p. 458; Marsh et al., 2004, p. 277). Based on multivariate distributions represented as finite mixtures of normal distributions, the cross-product effect of the moderator variables is estimated based on the EM algorithm unattenuated by measurement error (Kleinke et al., 2017, p. 70; Klein & Moosbrugger, 2000, p. 457). The “XWITH” command in conjunction with TYPE = COMPLEX and numerical integration is used to evoke latent moderated SEM for continuous factors (Little, 2013, p. 310). The interaction variable is included in addition to the predictors from which the variable is generated10 . As there are three measurement occasions for the EV constructs (t1 , t5 , and t9 ). For the exploration of interaction effects, only one interaction term was modelled at a time instead of three simultaneous terms because each additional moderator variable across time exponentially increases computational demands as a function of the dimensions of integration, i.e., the number of latent interaction terms (Kleinke et al., 2017, p. 71; Little, 2013, p. 310). Each of the interaction coefficients at one measurement occasion (i.e., self-efficacy × interest and self-efficacy × value separately at t1 , t5 , and t9 ) will then be checked for significant relations with subsequent achievement. The iterative procedure revealed that none of the interaction terms was significant for t9 regarding their impact on the exam performance. The interaction effects S9 × I9 and S9 × V9 will therefore be omitted from the subsequent analyses. Moreover, and most strikingly, the interaction terms do never significantly impact course emotions, but only learning-related emotions11 . Hence, for the CV interaction effects, a sparser model was compiled that only included the EV constructs (i.e., self-efficacy, interest/or value, and effort) and learning-related enjoyment and hopelessness at both measurement occasions. The omission of course emotions reduces the computational demand associated with the latent interaction, so 10

The interaction terms can only be modelled as predictors and not as endogenous variables, so that their effect on the subsequent constructs will be investigated only. 11 To rule out that interaction effects on course emotions disappeared due to model complexity, their significance had been double-checked in sparser models where the expectancy-value constructs and either course enjoyment or course hopelessness were included. In every case, the interactions effects on course emotions remained highly insignificant (p > .20).

276

8

Results of the Longitudinal Study

that both interaction terms (S1 × I1, S5 × V5) can be included simultaneously in each model. For starters, the model fit of the interaction models will be evaluated. As Mplus does not provide the common fit indices for models using numerical integration (Kleinke et al., 2017, p. 71), alternative procedures have to be used, starting with the log-likelihood ratio test and a comparison of the AIC and BIC criteria12 . Therefore, baseline models without interaction effects will be compared to the models with interaction effects for self-efficacy × interest (MS × I ) and self-efficacy × value (MS × V ) in separate models (see Table 8.20). For reasons of comparability, the baseline models do not include auxiliary variables for missing data since they cannot be used in conjunction with moderated latent SEM. Table 8.19 presents the model identifiers which are used to refer to the multiplicative models. Table 8.19 Identifiers for the Multiplicative Control-Value Models MC × I

Multiplicative version of MCI with interactive effect for S × I (without course emotions)

MC × V

Multiplicative version of MCV with interactive effect for S × V (without course emotions)

The model fit of the baseline models without interaction terms (MEV and MEI ) and interaction models (MS × V and MS × I ) is presented in Table 8.20. Based on the H0 value, the loglikelihood ratio test (LR) can be conducted as a first indication to compare the fit of each model. The H0 value represents the log likelihood value, whereby adjusted models should have a smaller value than f the baseline models. The significance of this decrease is tested by means of the LR test (Maslowsky et al., 2014). The test statistic D, indicating the degree of deviance, is calculated with the following formula: D = −2(loglikeli hoodmodel 0 − loglikeli hoodmodel 1)

12

Additionally, the r-squares between the multiplicative and additive models were compared (Murphy & Russel, 2017). The portion of explained variance of the respective dependent constructs attributable to the interaction terms ranges from .4 to .8%. As Murphy and Russel (2017, p. 552) note, the r-squares are of limited suitability to evaluate the added value of multiplicative models when the main effects already account for a substantial share of the variance. In that regard, the difference in r-square might underestimate the added contribution of explanatory power from the interaction term.

8.4 Empirical Modeling of the Integrative Model …

277

Table 8.20 Comparative Fit of Additive versus Multiplicative Models Model name

MEI

MS × I

MEV

MS × V

Interaction terms none

S1 × I1, S5 × V5 none

S1 × V1, S5 × V5

Parameters

230

244

261

H0 Value

−70,485.81 −70,459.54c

−76,801.51 −76,782.61c

AIC

141,431.62

141,407.07

154,097.01

154,087.22

BIC

142,658.67

142,708.81

155,414.75

155,479.65

247

For instance, when comparing the model 0 with main effects only (MEI ) to the model 1 with interaction terms (MS × I ), the test statistic amounts to 52.54. The degrees of freedom relevant for determining the critical value are calculated by subtracting the number of free parameters of model 1 from those of model 0. The degrees of freedom for MS × V are 244–230 = 14. The test statistic is then compared to a χ2 distribution table with df = 14, indicating that 52.54 is greater than 29.14, so that the ratio test is significant with a p-value < .01 Hence, accounting for the interaction terms through extra parameters in the more complex model leads to a statistically significant drop of the log likelihood value. This suggests that the null model fits significantly worse than the alternative model with the interaction effects at t1 and t5 . For the interest interaction model, the test statistic yields 37.80 with df = 14, so that the decrease in the likelihood value is highly significant as well. The AIC and BIC of the models provide a mixed picture. While the AIC of the interaction models are slightly smaller than those of the additive models, their BIC is higher, which suggests that the interaction model is not a better simplification of the data. Next, the unstandardized coefficients of the structural models will be consulted with regard to their significance, beginning with the structural model for MS × I (Figure 8.9)13 . The significant reciprocal and cross-lagged effects from the additive model MEI also apply to the multiplicative model (see Appendix 8 in the electronic supplementary material). Apart from the re-emerging paths, learning-related hopelessness negatively relates to exam performance. Strikingly, there are significant interactions for S1 × I1 on quiz 2 and S5 × I5 on quiz 4 while self-efficacy and interest, considered separately, have no significant impact on subsequent quiz scores 2 to 4. In other words, the reciprocal association between self-efficacy and interest on the one hand, and formative achievement after quiz 1 on the other 13

All reciprocal, cross-lagged, and autoregressive path coefficients can be found in Appendix 15 to Appendix 24 in the electronic supplementary material.

278

8

E1

E5 Q1

.568

Q2

E9 .753

Q3 .454

S5

S1 S I

Results of the Longitudinal Study

HL3

S9

S I HL7

I5

I1 JL3

E

Q4

I9 JL7

Figure 8.9 Path Diagram for the Interactive Relations between Expectancy-Interest, Quiz, and Emotions. (Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3)

hand persists through the significant multiplicative effect. The significance of the interaction despite the insignificance of the separate effects can be explained by recurring on the ceteris paribus assumption of the regression weights. Without the latent interaction modeling, the regression weight of interest at t5 on quiz 4 (ß = .005 with p = .614), for instance, represents the magnitude of this effect for students with a self-efficacy at a zero mean, assuming that the interaction equals zero. The significant interaction term (ß = .011 with p = .026) suggests a synergetic relation in such a way that interest on quiz performance is moderated by the level of self-efficacy. Hence, whenever a students’ self-efficacy increases by one unit, the effect of interest on quiz becomes more positive, increasing by .011 each time. The positive algebraic sign indicates that the higher students’ self-efficacy, the more predictive is value for formative achievement. Apart from achievement, the multiplicative terms S1 × I1 and S5 × I5 positively relate to subsequent learning-related enjoyment in addition to the significant crosslagged main effects of self-efficacy and interest on learning-related enjoyment. Particularly the multiplicative effect on JL 7 is highly significant (p < .000). The interaction effect on JL 7 prolongs the reciprocal relationship under the moderating effect of self-efficacy. There is no association between the interaction terms and learning-related hopelessness. Figure 8.10 depicts the multiplicative associations for MS × V .

8.4 Empirical Modeling of the Integrative Model …

E1

279

E5 Q1

.569

Q2

E9 .752

Q3 .451

Q4

E

S1 S5 S V

HL3

S V

S9

HL7

V9 V5

V1 JL3

JL7

Figure 8.10 Path Diagram for the Interactive Relations between Expectancy-Value, Quiz, and Emotions. (Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3)

Comparing the reciprocal paths of the additive model MEV to MS × V , there is no reciprocal relation between quiz performance and value anymore. This discrepancy is not due to the modelled interaction effects, but due to the inclusion of the achievement emotions and conforms to the effects of MCV . Concerning formative achievement, the interaction effect of S1 × V1 on the first quiz score was significant. For learning-related enjoyment at t3 and t7 , the interaction effects were both significant. The interaction effect on JL 7 prolongs the reciprocal relationship of value on learning-related enjoyment—which was insignificant in the additive model—under the moderating effect of self-efficacy. In sum, expectancy and value constructs have significant multiplicative relations with quiz performance and learning-related enjoyment in both interaction models. The multiplicative effect of value on formative achievement is however only relevant at the beginning of the semester and both interest and value constructs only have a small loading on the subsequent quiz scores. A limitation of these analyses is that the parameter estimates represent the magnitude of the interaction at exactly average levels of the moderator variable. It may however be that the relationship between value and quiz, for instance, may not have a meaningful effect within that particular scope. Therefore, the concrete range and scope of the interaction effects along with their mode of operation was further explored by means of Johnson-Neyman

280

8

Results of the Longitudinal Study

diagrams (Kleinke et al., 2017, p. 61). These diagrams allow for an investigation of the relationships across continuous levels of the moderator variable along with confidence bands, representing the regions for which the interaction effect becomes significant depending on the level of the moderator variables.

8.4.5

Visualization of Conditional Expectancy-Value Effects by Means of Simple Slopes

For a better understanding on the functioning of the significant interaction effects, simple slopes diagrams were generated for the significant multiplicative relations by means of model constraints (Kleinke et al., 2017, p. 65)14 . Two different types of plots were generated to investigate the scope of the interaction effects along different levels of self-efficacy. To that end, the standard deviation of selfefficacy ensuing from a mean of zero served as a reference frame to compare the differential functioning of the interactions15 . Starting with the multiplicative associations between quiz performance and self-efficacy with value, the JohnsonNeyman diagram was computed using the following formula: ad j. e f f ect o f V 1 → Q1 = b1 ∗ V 1 + b3 ∗ SV 1 ∗ S1

(8.1)

The formula represents the adjusted effect of V1 on Q1 as an additive function of the main effect of value at t1 on quiz 1 (b1 ) plus the degree of self-efficacy (S1) multiplied with the magnitude of the interaction effect of self-efficacy and value on quiz 1 (b3 ; Kleinke et al., 2017, p. 65) Based on the recommendations of the recommendation of Muthén & Muthén (n.d.), the range on the x-axis spans −3 to 3 standard deviations of the latent variable and ensures that the complete range of the data is considered16 . Figure 8.11 shows the resulting Johnson-Neyman diagram along with the 95% confidence bands of the interaction effect SV1 → Q1. The plot reveals a positive relation between the regression slope of the interaction effect, i.e., the adjusted effect of V1 on Q1 across all values of self-efficacy. 14

Johnson-Neyman diagrams were also checked for insignificant interaction effects to ensure that the insignificance applies to the complete range of the moderator variable. None of the insignificant interactions turned out to be meaningful as the confidence bands entailed the value of zero. 15 The latent means of the variables do not need to be centered because it is already fixed at zero per default. 16 The standard deviations of the latent moderator variable self-efficacy ranges around .9 and 1.12 according to the technical 4 output in Mplus.

8.4 Empirical Modeling of the Integrative Model …

281

Figure 8.11 Johnson-Neyman Diagram for the Effect of Value on Quiz 1 with Increasing Self-Efficacy

It also becomes evident that the regression slope of the interaction effect at an exact average level of self-efficacy (SD = 0 due to the centering of all latent variables) yields a value of .012, corresponding to the above path diagram. The confidence bands further show that the more self-efficacious a student is, the more significant the relationship between value and quiz becomes. The regression slope becomes significantly different from zero for a self-efficacy level that is approximately .4 standard deviations above the average self-efficacy. This significant range is visualized by the green area in Figure 8.11 where the lower and upper confidence bands to not include the value of zero anymore. In other words, the interaction between self-efficacy and value only becomes significant for students with an above-average self-efficacy. For a formal confirmation of the visual evidence, new parameters along with their C.I. were computed for individuals with a self-efficacy of 1 SD below and above the mean, confirming the visual findings (see Table 8.21). The estimates across the three levels show that value fosters quiz performance more strongly for students with higher levels of self-efficacy (.027) compared to students with lower self-efficacy (−.002). The interaction effect was insignificant for students with a below-average and average self-efficacy while it was significant for students with an above-average self-efficacy. Moreover, only the C.I. for high self-efficacious individuals does not include the zero value, as suggested in the Johnson-Newman diagrams. In a next step, the differential functioning of the interaction across individuals with varying levels of self-efficacy can be compared

282

8

Results of the Longitudinal Study

Table 8.21 Effect of Value at t1 on Quiz 1 at Varying Self-Efficacy Levels Level of Self-Efficacy

SV1 → Q1 Estimate

p

Lower 5 %

Upper 5%

−.002

.856

−.018

.015

Average

.013

.161

−.002

.028

High (+1 SD)

.027c

.014

.009

.046

Low (−1 SD)

in one comprehensive diagram (see Figure 8.12). The depiction of the regression parameters was based on the ordinary least squares estimation with the following exemplary formula for the effect of S1 × V1 on quiz 1: ad j. e f f ect o f V 1 → Q1 = b1 ∗ V 1 + b2 ∗ S1 + b3 ∗ V 1 ∗ S1

(8.2)

Selecting self-efficacy as the moderating effect of the influence of value on quiz, the formula is rearranged as: ad j. e f f ect o f V 1 → Q1 = (b1 + b3 ∗ S1 ) ∗ V 1 + b2 ∗ S1

(8.3)

The following plots depict the conditional effect of value on subsequent quiz scores at high (one SD above the mean), average, and low (one SD below the mean) levels of self-efficacy as moderator effect. For the equation for the effect of value on quiz performance for individuals with relatively lower levels of selfefficacy, for instance, the value of −1 is inserted into the equation, yielding: ad j. e f f ect o f V 1 → Q1 = (b1 + b3 ∗ (−1) ) ∗ V 1 + b2 ∗ (−1)

(8.4)

Figure 8.12 depicts the effect of value at t1 one quiz 1 under different levels of self-efficacy. The interaction plot shows the effect of value at t1 on the subsequent quiz 1 for three different levels of self-efficacy. For average levels of self-efficacy, value is positively related to quiz performance while the effect is only marginal, corresponding to the insignificance of the respective regression weight (see section 8.4.4). The effect for students who are below-average self-efficacious even has a slightly negative tendency. As the prior Johnson-Newman plots though revealed that these effects are insignificant whereas the positive effect for aboveaverage self-efficacy was significant. Hence, there seems to be no link between value and performance at low and average values of self-efficacy. Comparing

8.4 Empirical Modeling of the Integrative Model …

283

Figure 8.12 Diagram on the Effect of Value at t1 on Quiz 1 at Varying Self-Efficacy Levels

the slopes of each line reveals that the effect of value on quiz becomes more positive with higher levels of self-efficacy. In other words, the positive effect of value on quiz score for exactly averaged self-efficacious students becomes stronger for above-average self-efficacious students. Relations of values to formative achievement are thus stronger for students with a higher self-efficacy. Conversely, students with low value appraisals are unlikely receiving a better quiz score even if they were self-efficacious above average. While S1 × V1 was significant only at the beginning of the semester, the interaction effects for interest were significant at two measurement occasions (SI1 → Q2 and SI2 → Q4). Analogous to the value construct, Figure 8.13 depicts the Johnson-Newman diagram for SI1 → Q2. The green area in the diagram, where the confidence bands do not entail zero, reveals that the interaction effect is only significant when self-efficacy is about 3.5 SD above its mean level, which is beyond the range of the empirical data17 . Hence, contrary to the interaction effect of S1 × V1, the effect of interest on performance is insignificant for all levels of self-efficacy within the score range. The parameter estimates of the simple slopes for both interaction effects confirm that for self-efficacy levels below and above 1 SD, the interaction effects are 17

As mentioned earlier, the standard deviation of self-efficacy varies from .9 and 1.12 depending on the measurement occasion.

284

8

Results of the Longitudinal Study

Figure 8.13 Johnson-Neyman Diagram for the Effect of Interest on Quiz with Increasing Self-Efficacy

insignificant, and the lower and upper C.I. include zero. Table 8.22 shows the parameters for both interaction effects. Table 8.22 Effect of Interest at t1 on quiz 2 and t5 on quiz 4 for Varying Self-Efficacy Levels Level of Self-Efficacy

SI1 → Q2 ß

p

Lower 5%

Upper 5%

SI2 → Q4 ß

p

Lower 5%

Upper 5%

Low (−1 SD)

−.008

.363

−.023

.007

−.006

.604

−.024

.013

Average

.003

.728

−.010

.016

.005

.629

−.012

.021

High (+1 SD)

.014

.174

−.003

.030

.015

.154

−.002

.033

As the interactions of S1 × I1 and S5 × I5 on quiz performance 2 and 4 are not meaningful across the relevant levels of self-efficacy, further diagrams are omitted. Instead, it will be checked whether the multiplicative effect of self-efficacy, interest and value on learning-related enjoyment is significant and meaningful. Figure 8.14 depicts the Johnson-Newman diagram of SV1 → JL 3. Different to SV1 on quiz performance, the interaction effect on learningrelated enjoyment is already relevant for under-averaged self-efficacious students

8.4 Empirical Modeling of the Integrative Model …

285

Figure 8.14 Johnson-Neyman Diagram for the Effect of Value on Learning Enjoyment with Increasing Self-Efficacy

(approx. −.6 SD below the mean). While the multiplicative effect of value and self-efficacy on quiz performance only applied to those with and above-average self-efficacy, the effect on learning-related enjoyment thus has a broader effective range. Based on the pictorial results, the parameter estimates were tested with ± .5 SD to check the significance of the interaction (see Table 8.23). Table 8.23 Effect of Value on Learning Enjoyment at Varying Self-Efficacy Levels Levels of Self-Efficacy

SV1 → JL 3 ß

SV2 → JL 7

p

Lower 5%

Upper 5%

ß

p

Lower 5%

Upper 5%

Low (−.5 SD) .124b

.011

.044

.204

.005

.912

−.064

.073

Average

.187c

.000

.108

.265

.032

.427

−.034

.097

High (.5 SD)

.249c

.000

.162

.336

.059

.156

−.009

.127

For SV1 → JL 3, the results suggest that even for students with a self-efficacy which is .5 SD below average, the interaction still is significant. Apart from the p-value, none of the confidence bands include the value of zero. It can thus be assumed that less efficacious students profit from the multiplicative relation on

286

8

Results of the Longitudinal Study

their emotional state of mind rather than on their achievement. The effect for low-self efficacious students becomes insignifiant when tested with a standard deviation of approx. 1 SD. For SV2 → JL 7, the effect however becomes insignificant across all three groups, suggesting that the multiplicative relation diminishes across the semester. The group-specific effects for SV1 → JL 3 with variations of ± 1 SD in the level of self-efficacy are shown in Figure 8.15.

Figure 8.15 Diagram of the Effect of Value at t1 on Learning Enjoyment at t3 at Varying Levels of Self-Efficacy

The diagram underlines that the enjoyment-enhancing effect of value becomes stronger with higher levels of self-efficacy. Analogous to the Johnson-Newman diagram, the effect is still slightly positive for students with low self-efficacy, even though it is insignificant at −1 SD. The multiplicative effect of SV2 → JL 7 will not be detailed anymore because its mode of operation equals that of the prior one. However, the effect is only significant for students with a self-efficacy of approx. 1 SD above the mean. Finally, the self-efficacy-interest effect on learning-related enjoyment has to be checked. The diagrams strongly resemble those of the SV → JL relations and will therefore be omitted. The formal test results are shown in Table 8.24. Similar to SV1, the multiplicative effect for interest and self-efficacy is significant across all levels of self-efficacy. Considering the SD range from Table 8.24, the effective range for the interest construct, is even broader than for value. More concretely, even students with a self-efficacy below 2 SD of the average

8.4 Empirical Modeling of the Integrative Model …

287

Table 8.24 Effect of Interest on Learning Enjoyment at Varying Self-Efficacy Levels Levels of Self-Efficacy

SI1 → JL 3 ß

p

Lower 5%

Upper 5%

SI2 → JL 7 ß

p

Lower 5%

Upper 5%

Low (−1 SD)

.327c

.000

.235

.419

.002

.979

−.106

.109

Average

.417c

.000

.323

.510

.111a

.066

.012

.210

High (+1 SD)

.506c

.000

.379

.633

.220c

.002

.105

.335

self-efficacy profit from the effect of interest on learning-related enjoyment. The positive effect of interest on learning-related enjoyment comes into effect at lower levels of self-efficacy than the value effect. For SI2 → JL 7, and similar to SV2, the positive effect of interest on enjoyment only comes into effect at a higher self-efficacy level. Different to value, where the multiplicative effect became significant only under high levels of self-efficacy (+1 SD), an average self-efficacy (0 SD) suffices for interest to positively impact enjoyment by the end of the semester. In sum, the effects of interest and value on quiz performance and learningrelated enjoyment are maximized in regions of higher self-efficacy. Even though significant in the structural models, the multiplicative effect of interest and selfefficacy on quiz performance was minuscule across the continuum of self-efficacy, rendering it meaningful only under very unusual conditions. Conversely, the performance-enhancing effect of value at the beginning of the semester is limited to range of students occupying above-average score ranges where the interaction between self-efficacy and value meaningfully contributes to the additive model. Most multiplicative effects on quiz performance are likely to be negligible because the separate effects of the EV constructs had only been minor and inconsistent across the different models in section 8.3. The moderating effects on learning-related enjoyment were however stronger. Particularly at the beginning of the semester, even students with an under-average self-efficacy profit from the enjoyment-enhancing interest/value effect. More concretely, the multiplicative effect remains significantly positive even for below-average self-efficacious students. The scope of the effects becomes more limited by the end of the semester, so that the strength of the positive relation between value and enjoyment still increases with higher rates of self-efficacy, but with now relationship at low and average levels of self-efficacy. For interest, the effect is more sustainable because only an average level of self-efficacy (0 SD) is needed to profit from

288

8

Results of the Longitudinal Study

a positive effect of interest on enjoyment. The dependency on the level of selfefficacy could be the reason why the separate effects of interest and value on learning-related enjoyment become much less significant at t5 . The significant multiplicative effects thus underline that the effects are still meaningful under a sufficient degree of self-efficacy. To conclude, there was only one meaningful multiplicative effect on quiz performance. However, the positive multiplicative effects on learning-related enjoyment might also be beneficial for quiz performance since enjoyment at t3 positively related to quiz performance. In general, the interaction effects become more meaningful with increasing levels of self-efficacy. Even though the moderator effects mostly function in accordance with the multiplicative association between expectancy and value, their low contribution to the r-square of the endogenous variables suggests that interpretations about the small effects are to be made with caution (see footnote 12). Another restriction in the latent interaction model involves the arbitrary stipulation of sample-dependent SD ranges, which may not be representative of the real world. Due to the lack of alternative reference frames and the sufficient sample size, it is still assumed that they are approximate population parameters. Moreover, more sparse models needed to be computed due to computational demands. Even though representing only smaller segments of the priorly stipulated causation models, most effects between the interaction models and the separate model were consistent, so that they still remained sufficiently comparable.

9

Multiple Group Causal Analyses

9.1

Operationalization of the Grouping Variables

In a next step, multiple group analyses were conducted to investigate whether the reciprocal effects hold true across different groups, or subsamples, separated according to gender, prior achievement-related experience, and the teaching format. In the multiple group analyses, these observed, dichotomous variables are used to investigate the invariance of the structural relations between the postulated motivational, emotional, and cognitive variables. If there are path coefficients for which the invariance assumption does not hold, the grouping variable is assumed to exert a moderating effect on that respective relation. Gender (0 = female, 1 = male) is a variable without missing values as it was matched administratively with the anonymous code of the participants. The same applies to the variable teaching format (0 = TC, 1 = FC), which is a variable that was generated according to the cohort of participation, i.e., 2017 or 2018. The variable math experience reflects students’ final math grade in school as a proxy for their prior achievement-related knowledge1 . The original variable had five categories ranging from a “very good” to a “inadequate” grade. For the multiple group analyses, the variable was dichotomized to have a manageable number of groups (0 = 1

Another possible approximation of prior achievement-related experiences could be whether students took an advanced math course at school. Compared to the math grade, however, this variable would have split the full sample into two groups of unequal size (one third with math advanced course, two thirds with). Moreover, students with or without an advance course do not differ significantly in their final math grade. This is why the final math grade is considered as the more meaningful predictor as it reflects the actual performance while the advance course rather reflects a motivational decision at the beginning of the German upper school.

© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 A. Maur, Electronic Feedback in Large University Statistics Courses, https://doi.org/10.1007/978-3-658-41620-1_9

289

290

9

Multiple Group Causal Analyses

good performer, 1 = bad performer). The participants were split into two groups based on the median performance, so that a “good performer” had an A or B grade in school while a “bad performer” had no more than a C grade, or worse. This splitting of prior achievement also ensured that both group sizes are nearly in the ratio of 50 to 50 (716 good performers to 678 bad performers). Other empirical studies found support for factor loading invariance across gender for both EV appraisals (Bechrakis et al., 2011; Dauphinee, et al., 1997; Emmioglu et al., 2018; Hilton et al., 2004) and achievement emotions Davari et al., 2020) while invariance of prior knowledge and teaching context has not yet been investigated. Following the same procedure from the general causation models, before the comparison of the general latent mean change across the different groups throughout the semester, the assumption of scalar invariance will be tested.

9.2

Testing Measurement Invariance Across Groups

Analogous to comparisons of means and path coefficients across time, measurement invariance across the investigated groups is an important prerequisite to conduct the analyses (Kleinke et al., 2017, p. 78). Such group differences might evoke different response patterns making it difficult to ensure that significant differences in factor structures, loadings, and intercepts do not stem from differences of the construct properties at the group levels. The required invariance level depends on the statistical measure to be compared according to the research questions. In order to compare factor means of the EV and achievement emotion constructs across the subsamples, scalar invariance of intercepts is necessary and the default setting in Mplus (Kleinke et al., 2017, p. 79). The baseline models for the latent mean comparisons across groups are the optimized separate models with their respective degree of time invariance established in section 8.2. As elaborated in section 5.4.2, the group comparisons for prior knowledge are based on one imputed data set for the variable math grade to avoid excluding more than 100 cases (depending on the respective models) completely from the analyses2 . Apart from the auxiliary variables used in conjunction with the FIML procedure, other variables that might account for missing patterns were included in the imputation (such as final school grade, math advanced course, test anxiety). Based on 2

For the multiple groups analyses, the data for both groups has to be in one dataset with the grouping variable and the observations per group must not deviate across the datasets. As multiple imputation generates several datasets with sometimes discrepant numbers of observations per group, single imputation was applied with Ndatasets = 1.

9.2 Testing Measurement Invariance Across Groups

291

these models, the goodness-of-fit of additional strong factorial invariance across groups for the measurement models is displayed in Table 9.1. Table 9.1 Comparative Model Fit of the Baseline Model versus Group-Specific Strong Invariance Grouping variable Model criterion S, D

I, V, A, E

JC , H C

baseline model (section 8.2)

Design

Gender

Math Grade [imputed]

χ2

391.95

642.92

626.61

537.19

df

125

262

262

262

RMSEA

.037

.044

.043

.038

SRMR

.058

.075

.074

.062

CFI/TLI

.947/.935

.927/.914

.928/.916

.938/.927

Time invariance

Strong partial invariance of intercepts, except [s54], [d14]

Group invariance

Strong invariance of intercepts

#obs

1523

χ2

Trad = 728 FC = 795

Female = 716 Male = 805

good = 752 bad = 716

1,761.99c

2,928.10

3,028.48

2,905.6

df

637

1298

1298

1298

RMSEA

.034

.041

.042

.041

SRMR

.049

.075

.073

.067

CFI/TLI

.941/.931

.913/.900

.907/.894

.908/.895

Time invariance

Strong partial invariance of intercepts, except [a13], [e53]

Group invariance

Strong invariance of intercepts

#obs

1523

χ2

Trad = 728 FC = 795

Female = 716 Male = 805

good = 762 bad = 708

1,953.24c

3,031.67

3,099.09

2,81.70

df

635

1299

1299

1299

RMSEA

.037

.042

.043

.042

SRMR

.063

.090

.087

.075 (continued)

292

9

Multiple Group Causal Analyses

Table 9.1 (continued) Grouping variable

JL , H L

Model criterion

baseline model (section 8.2)

Design

Gender

Math Grade [imputed]

CFI/TLI

.939/.939

.920/.922

.915/.918

.923/.925

Time invariance

Strict invariance of residual variances

Group invariance

Strong invariance of intercepts

#obs

1536

χ2

391.56c

df

118

RMSEA

.039

SRMR

.041

CFI/TLI

.975/.974

Time invariance

Strict invariance of residual variances

Group invariance

Strong invariance of intercepts

#obs

1524

Trad = 741 FC = 795

Female = 723 Male = 811

good = 693 bad = 624

639.81

682.86

666.01

258

258

258

.044

.047

.048

.063

.067

.062

.966/.968

.960/.963

.957/.960

Trad = 729 FC = 795

Female = 729 Male = 795

good = 727 bad = 649

Concerning the fit criteria, the RMSEA fulfills the criteria of Chen (2007) as it does not decrease by more than .015 compared to the baseline models. The RMSEA is also always smaller than the recommended cutoff of .05. The increase in SRMR does not exceed .030 and hence adheres to Chen’s recommended thresholds. However, for the course emotion model, it surpasses the acceptable cutoff of .08. Further modifications on grounds of the SRMR are not considered to avoid an overly sample-specific adaptation of the model and because the SRMR performs better for the structural models. This could stem from the fact that additional structural relations account better for the for the observed correlations. The CFI and TLI should not increase by more than .020 from no to strong invariance (Chen, 2007). This threshold is mostly fulfilled, except for the value models and the gender-specific course emotion model. For all group-specific value models, the TLI also comes close to or even falls below the cutoff value of .9. It has to be kept in mind however that the value model

9.3 Group-specific Average Development Throughout the Semester

293

comprises four constructs and four indicator-specific factors instead of separately analyzing each measurement model3 . The overview suggests that most fit criteria are still acceptable despite the additional strong restrictions for group and time invariance. Hence, further adaptations to the constructs to adhere to the invariance assumptions better in terms of model fit (i.e., letting error terms correlate or relieving invariance restrictions of certain loadings or intercepts) will not be made. There is also no considerably high modification index suggesting freeing a specific factor loading or intercepts across groups. A model improvement therefore would render freeing several parameters necessary. Accepting the slight loss of model fit also helps to avoid overly arbitrary, sample-specific modifications of the postulated model and to keep the factor structures comparable across groups in their entirety. Based on these models of strong factorial invariance across time and gender, the latent means for the different groups will be estimated and compared next.

9.3

Group-specific Average Development Throughout the Semester

For the estimation of latent means across groups, Mplus fixes the mean of the first group at zero, so that the reference group of the comparison is that with the value of 1 (see section 5.2.2). Hence, Table 9.2 indicates the means of the second group in relation to the first group (i.e., TC compared to FC format, male to female, bad to good performers). Concerning the course design, the t1 differences are not to yet attributable to differences in the course design as the EV appraisals had been assessed before announcing the details on the different course designs4 . Accordingly, is seems logical that no significant differences were found for self-efficacy and difficulty. For interest, value, affect, and effort, however, the students in the FC semester start with more beneficial manifestations than in the traditional course one year before. Throughout the flipped semester, however, students on average develop 3

Separate consideration of the value constructs reveals that the measurement model of effort has a worse fit than interest, value, and affect when strong group invariance is imposed (CFI/TLI < .90; SRMR/RMSEA > .08). The other three constructs fit well under the assumption of strong invariance across group and time. In order to maintain the comparability between groups, no further modifications of the effort construct will be considered. 4 Most cohort-specific effects regarding student characteristics have already been ruled out in section 5.4.1, so that the effects are assumed to stem from the implementation of the different designs.

−.129

.027

t9

−.197

−.397

−.068

−.238b

t5

t9

−.577

−.388c

t9

t1

.164a

.001

−.198

−.039

t5

Positive Affect

.234

.386c

t1

.327

−.198

.121

.537

−.079

.061

.873c

−.468c

−.110

.388c

−.314c

−.144c

.299c

.711

−.641

−.266

.237

−.458

−.271

.181

−.528

−.409

−.384c

−.390

−.292c

−.105

.134

.333 .019

.341

.518

1.035

−.295

.046

.539

−.169

−.018

.416

−.241

−.175

−.188

Up

95% CI Mean Low

−.289c

.019

.237c

.425c

Mean difference (male)

−.392c

−.552

−.811

−.421 −.635c

.060 −.262c

−.784

−.442

−.035

.008

.112

.212b

−.631c

−.314c

.083

.150a

.288c

−.007

−.612

−.484c .095

−.377

−.240

(continued)

−.231

−.459

−.102

.365

−.477

−.185

.201

.292

.344

.196

−.356

−.172

−.054

Up

95% CI Mean Low

−.275c

−.147c

Mean difference (bad performer)

9

Value

.143

.261c

t1

.378

.183

−.141

−.022

t5

Interest

.097

−.083

.020

t1

.123

.008

−.252

−.122

Difficulty

t9

.183

t5

−.023

.081

t1

.080

Up

.176

Low

95% CI Mean

−.013

Self-Efficacy

Mean difference (flipped)

Table 9.2 Average Development for Course Design, Gender, and Proficiency across the Semester

294 Multiple Group Causal Analyses

−.463

−.256b

t9

−.744

−.371

−.610c

−.201b

t5

t9

−.477

−.277

−.406

−.333c

−.124b

−.242b

t4

t6

t9

.217

.076

.292

.355c

.222b

.452c

t4

t6

t9

Learning Enjoyment

−.237

−.117

t2

Course Hopelessness

.019

.152a

t2

Course Enjoyment

.384

.504c

t1

Effort

.010

.181a

Low

.612

.367

.492

.004

.078

.028

−.189

.286

−.032

−.476

.624

−.048

.353

Up

95% CI Mean

t5

Mean difference (flipped)

Table 9.2 (continued)

.214c

.097

.230c

−.187b

−.163

.089

−.120

.313c

−1.184c

−1.343c

−.184b

.458c

.705c

Mean difference (male)

.376

.237

−.044 .052

.367

−.059

.007

.239

.023

.447

−1.021

−1.210

−.061

.646

.875

.093

−.315

−.334

−.061

−.263

.179

−1.347

−1.476

−.306

.269

.536

Up

95% CI Mean Low

.491c

.343c

.403c

−.023

−.506c

−.329c

−.484c

.130a

−.367c

−.585c

.493c

−.665c

−.453c

Mean difference (bad performer)

.251

.209

.279

−.136

−.658

−.464

−.613

.006

−.535

−.725

.370

−.851

−.621

(continued)

.532

.478

.526

.090

−.506

−.329

−.356

.254

−.200

−.445

.616

−.476

−.284

Up

95% CI Mean Low

9.3 Group-specific Average Development Throughout the Semester 295

.117

.246c

t7

−.753

.086

−.626c

.229c

t3

t7

Learning Hopelessness

.230

.354c

Low

.372

−.499

.375

.478

Up

95% CI Mean

t3

Mean difference (flipped)

Table 9.2 (continued)

.381c

−.360c

.104

.165b

Mean difference (male)

.234

−.493 .528

−.226

.292 .236

.038 −.028

Up

95% CI Mean Low

.872c

.082

−.225c

−.122

Mean difference (bad performer)

.730

−.048

−.355

−.248

1.013

.211

−.094

.004

Up

95% CI Mean Low

296 9 Multiple Group Causal Analyses

9.3 Group-specific Average Development Throughout the Semester

297

more negative manifestations concerning these attitudes, which could be related to increasing uncertainties in the more openly structured design. The effect size of these differences however is small. Differences in the initial achievement emotions could already stem from different receptions of the course design as they were assessed beginning from t2 , so that students are already familiar with the respective course structure. However, the average tendencies mostly correspond to those of the expectancy and value appraisals. Course emotions become more negative on average in the FC (less enjoyable, more frustrating). Learning-related hopelessness at t3 however is significantly lower in the FC whereas it changes to the contrary by the end of the semester, which may also stem from the increasing complexity of the statistical topics. Enjoyment outside the course is the only facet that is consistently higher in the flipped course. Design-related differences remain fairly constant throughout the semester, except for effort as well as hopelessness in-class and outside the class. The effect sizes for gender and prior knowledge are in most cases higher than those of the course design. Gender-related differences in most cases either decreased or changed to the opposite, i.e., from favoring males at the beginning to favoring females at the end of the semester. Male students start the course with a higher self-efficacy as well as a lower appraisal of difficulty and anxiety (affect). The difference in self-efficacy decreases throughout the semester, while the differences in difficulty and affect prevailed. By the end of the semester, male students have a lower sense of interest and value, even though they had started with more favorable manifestations at the beginning of the semester. Males also indicate to invest lower effort into the statistics course than female students on average. The gap widens throughout the semester and is most considerable across all groups and constructs. These average tendencies conformed to the empirical state of research. Compared to expectancy and value, achievement emotions seem to be less affected by gender differences according to the smaller effect sizes. Male students start with higher enjoyment and less hopelessness. The difference in course and learning enjoyment become insignificant throughout the semester while male students feel more frustrated in and outside the course at the end of the semester. Bad performers on average have a lower self-efficacy and a higher difficulty appraisal throughout the semester while the difference in difficulty is only small. Hence, bad performers seem at least to have an accurate self-awareness and do not overestimate their own capabilities. Bad performers also have more unfavorable manifestations on the other motivational and emotional constructs (i.e., less interest, value, enjoyment as well as more anxiety and hopelessness). On grounds of these negative experiences in achievement settings, they also seem

298

9

Multiple Group Causal Analyses

they put less average effort in studying statistics. Strikingly, the differences to the disadvantages of bad performers (i.e., self-efficacy, interest, value, affect, and achievement emotions) become more pronounced over the course of the semester. Considering that gender and prior experience are immutable characteristics, it is important to note that they seem to play a larger role in students’ heterogeneity than the course design. Under consideration of the general tendencies of the group-specific average development, the separate EV models MC , MV , the course- and learning-related emotion models and the multiplicative CV models of achievement emotion5 will be investigated regarding differential structural relations according to gender, prior knowledge, and design.

9.4

Group-specific Reciprocal Causation Models

9.4.1

Goodness-of-fit of Each Group-Specific Model

For the comparison of relations between constructs within a nomological net, the weaker level of metric invariance (full or partial) suffices to ensure that the scale intervals are comparable across the respective groups and measurement occasions6 (Kleinke et al., 2017, p. 79; Steenkamp & Baumgartner, 1998, p. 82). Hence, before comparing the structural relations of the models, the goodness of fit will be considered with the assumption of weak factorial invariance across time and strong invariance across groups7,8 . 5

The additive control-value models will not be considered in the multiple group analysis to minimize redundancies to the analyses of their multiplicative counterparts. 6 Scalar invariance for the comparison of factor relations across groups would be needed when absolute scale scores instead of latent factors are used for modelling. Other levels of invariance (i.e., to test for population heterogeneity with the invariance of factor (co-) variances) are outside the scope of the present research endeavor. 7 The Mplus default is to fix factor loadings and intercepts across groups (strong invariance). Reverting the restrictions to factor loading invariance across groups is done by repeating the model command for the second group, only including the intercepts. Models of weak invariance across time and strong invariance across on the one hand were compared to models of weak invariance across time and groups on the other. These comparisons showed that the latter models were only marginally better than those with the additional restriction of intercept invariance across groups, adhering to Chen’s cutoff criteria. Hence and for better comparability of the groups, the models underlying intercept invariance across groups and weak invariance across time will be used and reported. 8 For reason of brevity, the course- and learning-related models are summarized to one testing group.

9.4 Group-specific Reciprocal Causation Models

299

Table 9.3 shows the goodness-of-fit for the structural models according to gender, design, and prior knowledge. Table 9.3 Model Fit of the Group-Specific Models under Weak Factorial Invariance Original model

Gender

Design

Math Grade [imputed] 787.56

Expectancy model χ2

567.93c

872.23

872.09

df

200

414

414

414

SCF [%]

13.4

9.7

1.0

12.14

RMSEA

.035

.038

.038

.034

90% C.I.

.031−.038

.034−.041

.034−.041

.030−.038

SRMR

.056

.071

.071

.055

CFI

.952

.942

.941

.945

TLI

.939

.929

.928

.933

# obs

1538

female = 724 male = 812

trad = 743 flipped = 795

good = 779 bad = 754 3,141.5

Value model χ2

2.037,56

3,345.63

3,386.71

df

779

1639

1639

1639

SCF [%]

1.5

7.5

7.7

9.1

RMSEA

.032

.037

.037

.035

90% C.I.

.031−.034

.035−.039

.035−.039

.033−.036

SRMR

.053

.065

.065

.056

CFI

.942

.920

.919

.926

TLI

.929

.908

.907

.914

# obs

1538

female = 724 male = 812

trad = 743 flipped = 795

good = 792 bad = 741

Course-related achievement emotions χ2

2,152.09c

3,297.49

3,222.22

3,012.82

df

762

1559

1559

1559

SCF [%]

13.06

8.6

9.4

11.4

RMSEA

.034

.038

.037

.035 (continued)

300

9

Multiple Group Causal Analyses

Table 9.3 (continued) Original model

Gender

Design

Math Grade [imputed]

90% C.I.

.034−.036

.036−.040

.035−.039

.033−.037

SRMR

.057

.067

.069

.062

CFI

.943

.929

.933

.935

TLI

.939

.926

.929

.932

# obs

1548

female = 730 male = 816

trad = 753 flipped = 795

good = 780 bad = 722

Learning-related achievement emotions χ2

618.49c

845.10

828.87

828.54

df

177

372

372

372

SCF [%]

15.3

13.1

12.9

13.5

RMSEA

.042

.044

.043

.043

90% C.I.

.039−.046

.040−.048

.039−.047

.039−.047

SRMR

.098

.099

.095

.075

CFI

.967

.964

.965

.961

TLI

.960

.959

.960

.956

# obs

1381

female = 630 male = 700

trad = 662 flipped = 669

good = 697 bad = 629

Notes. df = degrees of freedom, SCF = scaling correction factor

The model fit indices are mostly adhering to the specified cutoff ranges. For all group-related learning emotion models, the SRMR values are greater than .08, except for the imputed math grade model. The higher values might stem from the lower amount of measurement occasions and constructs included in the model. Moreover, the sample size for the learning-related models is smaller because they were assessed in an online survey in which students might feel less diligent to process the assessment compared to in-class contexts. Since the other fit indices suggest an appropriate fit, no further modifications will be applied.

9.4.2

Design-specific Reciprocal Causative Models

In the following, the different structural relations between expectancy, value, emotions, and the quiz performance will be compared across the three different group constellations under the previously described invariance conditions. The model

9.4 Group-specific Reciprocal Causation Models

301

structure for the multiple group analyses were kept identical to those of the overall models, i.e., all cross-lagged and autoregressive relations were controlled for. These relations will however not be thematized to keep the focus in the reciprocal quiz relations. In most cases, cross-lagged and autoregressive effects did not differ much across groups. The following models compare the reciprocal relations between AME appraisals, and quiz performance across the two different cohorts of the TC and FC. Table 9.4 depicts the structural relations of the original overall model and the groups-specific models in which the binary design variable functions as moderator. The p-value of the difference refers to the Wald test in Mplus which was programmed to test the null hypothesis whether the unstandardized regression weights in both groups are the same in the population. A significant p-value thus indicates that both regression weights significantly differ from each other: Table 9.4 Design-Specific Structural Relations between Quiz and Expectancy Factors Self-Efficacy Original effect

S1 → Q1 S1 → Q2 Q1 → S5 S5 → Q3 S5 → Q4 Q4 → S9 S9 → E .026c

.014b

1.050c

.019b

.001

.695c

.012b

Traditional

.027b

.028c

.398a

.026b

.002

.685c

.006

Flipped

.025b

.003

1.592c

.014

.001

.642c

.019b

.002

.000

.017

>.100

ρ of 

>.100

>.100

>.100

Difficulty Original effect Traditional

D1 → Q1 D1 → Q2 Q1 → D5 D5 → Q3 D5 → Q4 Q4 → D9 D9 → E .001

−.009

−.046

.009

−.006

.113

−.027c

.007

−.021b

−.043

.023b

.008

.168

−.034c

−.030

−.021b

Flipped

−.005

.004

.078

.003

−.018b

ρ of 

>.100

.074

>.100

.000

.000

.017

.000

Notes. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3; ρ of  is the p-value for the H0 that both regression weights are equal in both groups.

The first effect of initial self-efficacy on quiz 1 is not significantly different in both designs, which might stem from the fact that the assessment of the baseline self-efficacy had taken place before student were informed about the differing course organization and requirements. Hence, in both designs, students with a higher initial self-efficacy by one unit scored on average 2.7 percentage points better in the TC and 2.5 percentage points in the FC, which is close to the original, overall effect. All subsequent effects of self-efficacy on quiz performance,

302

9

Multiple Group Causal Analyses

however, are only significant in the TC design and significantly differ from the FC design according to the Wald test. The insignificant effects from prior selfefficacy stand in stark contrast to the impact of quiz 1 on subsequent self-efficacy in the FC design, which is significantly higher compared to the TC design. By the end of the semester, the path coefficients in both designs (Q4 → S9) reach a plateau of a medium-sized but still highly significant magnitude. A possible inference from these findings could be that in the TC design, the prior belief in one’s own capabilities is more decisive for subsequent performance. This also suggests that the self-efficacy belief in the TC might be more accurate than in the FC design, in which no significant relation was found as the semester passes by. It could be assumed that the FC leads to uncertain self-appraisals due to its more open structure, which would tie in with the large effect of quiz 1 on subsequent self-efficacy, while the reduction of the coefficient by the end of the semester could imply that students acclimatized themselves and their expectancies to the course design. More concretely, the quiz effect might be larger in the FC because the quizzes are the major source from which students receive binding feedback in the otherwise rather self-regulated format. Moreover, since students in the FC have to actively work out the statistical contents and solutions by themselves, they might more likely project performance-related success to their own abilities and skills. Balancing the lack of significant latent mean self-efficacy differences in both designs against the significantly higher impact of the first quiz on selfefficacy in the FC, it could be assumed that the average self-efficacy in the FC may even have been lower if it were not for the quiz. In the FC, there is no significant follow-up effect from self-efficacy to subsequent performance related to quizzes 3, which suggests that the prior selfefficacy-enhancing effect is more transient compared to the TC and that some potential of the high quiz effect on self-efficacy is wasted. At least, the reciprocal relation for in the FC occurred again regarding the final exam. Students with a higher self-efficacy by one unit in the FC perform better by 1.9 percentage points in the final exam. It also has to be considered that the interpretation applies in the other direction, i.e., students with a lower self-efficacy also achieve worse in the exam. Hence, even though with a delay, the FC seems to contribute to a more accurate self-appraisal by the end of the semester while the effect is insignificant for the TC design. The effect could be stronger in the FC because the higher involvement in the course renders students’ self-appraisal more accurate than in TC where the curricular content and even task solutions are spoon-fed to the students in a passive way of transmission. Regarding the appraisal of statistics-related difficulty, the effects proceeding from the quizzes are insignificant in both designs and conform to the overall

9.4 Group-specific Reciprocal Causation Models

303

effects. The effects from difficulty on subsequent quiz score vary across the designs while the overall relationships were insignificant. In the TC design, the prior difficulty appraisal tends to be more relevant. However, the coefficient for D1 → Q2 is significantly negative while D5 → Q3 is positive, which might result from different motivational mechanisms. At the beginning of the traditional semester, students with a higher difficulty appraisal score worse in the subsequent quiz, while students with a lower difficulty appraisal score better. In the midterm, the effect is reversed in such a way that students with a higher difficulty appraisal score better at quiz 3. Hence, at the beginning of the semester students, might be more motivated by easier tasks while more difficult, challenging tasks seem appropriate as the semester and experience proceeds. This could also be an indication of the necessity of adaptive tasks. Comparable to the effects from self-efficacy on quiz, the difficulty effect on quiz in the FC only becomes relevant by the end of the semester. It could be that statistics-related difficulty carries hardly any weight during the flipped semester because students need more time to form an appropriate self-appraisal regarding future performance in the open structured design. The difficulty effect on the final exam is relevant in both designs. Here, the coefficient is once again negative, implying that students who perceive statistics to be easier, score higher in the exam. The consistent negative effect for difficulty on exam score might stem from the fact that it is high-stake, making it less unlikely that students draw motivation from feeling overchallenged by the end of the semester. Finally, it has to be kept in mind that the coefficients lack a design-specific consistent pattern and that the latent means also do not differ significantly. Hence, the above-mentioned assumptions can only be considered first attempts to explain the slightly different trajectories between difficulty appraisals and quiz scores. Transitioning from expectancy to value appraisals, Table 9.5 depicts the design-specific path coefficients. Starting with the interest and value appraisals, the effect of initial value on quiz 1 is only significant in the TC design while the effect of initial interest is significant in the FC design. These effects on quiz 2 however only occur one time only at the beginning of the semester and are therefore not considered systematic. The quiz 1 and 4 effects on subsequent value and interest are higher in the FC, but not significantly so. Particularly the impact of quiz 1 on subsequent interest is twice as high in the FC design. By the end of the semester, the quiz effects decrease, but are still higher in the FC. Only the effect of quiz 4 on interest in the FC design remains barely significant by the end of the semester. The interestand value-enhancing quiz impact might be higher on grounds of the intrinsic motivation which is rather addressed in the FC. Considering that the latent mean

304

9

Multiple Group Causal Analyses

Table 9.5 Design-Specific Structural Relations between Quiz and Value Factors Value Original effect Traditional Flipped ρ of 

V1 → Q1 V1 → Q2 Q1 → V5 V5 → Q3 V5 → Q4 Q4 → V9 V9 → E .011a

.713c

.001

−.004

.369a

.011

.564a

−.006

−.009

.299

.003

−.003

.006

.822b

.001

−.009

.495

.001

.023

>.100

>.100

>.100

>.100

>.100

>.100

I1 → Q1

I1 → Q2

Q1 → I5

I5 → Q3

I5 → Q4

Q4 → I9

I9 → E

.020b

.001

.868c

−.004

.003

.458b

−.009

.006 .021b

.006

Interest Original effect

Traditional −.004

−.009

.508a

−.007

.006

.369

−.004

Flipped

.038c

.007

1.157c

−.009

.005

.536a

−.012

ρ of 

.028

>.100

>.100

>.100

>.100

>.100

>.100

Affect Original effect

A1 → Q1 A1 → Q2 Q1 → A5 A5 → Q3 A5 → Q4 Q4 → A9 A9 → E .008a

.005

1.240c

.010a

.011b

Traditional

.011a

.014c

.801b

.016b

.007

.185

.007

Flipped

.006

−.004

1.403c

.006

.014b

.225

.013b

ρ of 

>.100

.021

>.100

>.100

>.100

.085

>.100

.012b

>.100

Effort Original effect

E1 → Q1 E1 → Q2 Q1 → E5 E5 → Q3 E5 → Q4 Q4 → E9 E9 → E .009

.008

.835c

−.002

.047c

.014

.738c

−.002

.054c

−.002

.783b

−.006

.007

1.048c

.047c

Traditional

.009

.012

1.464c

Flipped

.009

.006

.705b

ρ of 

>.100

>.100

.061

>.100

>.100

>.100

>.100

Notes. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3. ρ of  is the p-value for the H0 that both regression weights are equal in both groups.

differences for interest and value decrease throughout the semester compared to the TC, it is notable that students still seem to draw more intrinsic motivation from the quiz performance. The pattern of the design-specific structural relations of affect on quiz are similar to those of self-efficacy and, at least partly, of difficulty. At the beginning of the traditional semester, students with a higher positive affect, i.e., students who feel less stressed and scared by statistics, score significantly better in the

9.4 Group-specific Reciprocal Causation Models

305

subsequent quizzes 1, 2, and 3. This effect becomes insignificant afterwards. For the flipped semester, this relation comes into effect only by the end of the semester, but then also impacts the final exam score, so that students with a more positive affect score better, and vice versa. It seems to be a more consistent pattern that some motivational preconditions are more influential in the first half of the traditional semester on the one hand, but more relevant by the end of the flipped semester on the other. Similar to self-efficacy, the impact of the first quiz on subsequent affect is higher in the FC, but not significantly so. This implies that quizzes in the FC seem to contribute more strongly to reduce stress and anxiety. Afterwards, affect also becomes a relevant predisposition for the subsequent quiz 4 and the final exam. This could be an indication of different timings of the motivational mechanisms in both designs for the affect and self-efficacy construct. In the TC, the quizzes seem to have a more confirmative character for students reliant on their positive preconditions, i.e., positively minded students also score better in subsequent quizzes. In the FC, the quizzes seem to function more as an initially strong catalysator of subsequently more positive self-appraisals since the impact of the first quiz is always stronger than in the TC and initiates the reciprocity from then on up to the exam. Another reason for the delayed effect in the FC could be the innovative format resulting in uncertainties to which students have to adjust themselves first. The effects of the effort construct are fairly stable in both designs compared to the original effects. The effect of quiz 1 in midterm effort however is significantly smaller in the FC, thus being the only quiz effect that seems to be more beneficial for students of the TC. A possible reason for this discrepancy might be that the mandatory quizzes at the beginning of the semester were the first task that required students to become active in the otherwise more passive teaching format while in the FC, students needed to invest effort right from the start to keep the pace of the syllabus. Hence, students in the FC might perceive the additional effort to work on the quizzes to be lower as they were already more active engaged in the course. This coincides with the latent mean difference, indicating that, by the midterm, students feel less stressed in the FC on average. The similar effect from quiz 4 to subsequent effort also suggests that by the end of the semester, the effort-enhancing quiz effect levels off at a medium magnitude in both designs. Continuing with course emotions, Table 9.6 depicts the structural relations in both designs. The reciprocal relations of course enjoyment are consistently only relevant and significant in the FC while only the initial precondition is significant in the TC

.087

ρ of 

−.674b −.906c

−.020

−.049c

.084

Traditional

Flipped

ρ of  >.100

>.100

−.143

−.771c

−.477b

Q2 → Hc 6

>.100

−.005

−.003

−.007

Hc 6 → Q4

>.100

.000

.002

−.001

Jc 6 → Q4

>.100

−.786c

−.515b

−.606c

Q4 → Hc 9

>.100

.749c

.249

.434c

Q4 → Jc 9

>.100

−.024c

−.016b

−.021c

Hc 9 → E

>.100

−.014b

−.007

−.010b

Jc 9 → E

Notes. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3. ρ of  is the p-value for the H0 that both regression weights are equal in both groups.

>.100

−.023c

−.706c

>.100

−.084

.153

.020

Q2 → Jc 6

9

−.005

−.014b

Q1 → Hc 4

Original effect

−.037c

Hc 4 → Q2

.004

.032c

.004

Hc 2 → Q1

Course Hopelessness

>.100

.174

Traditional .583a

.004

−.003

.018c

.472b

Flipped

Jc 4 → Q2

Q1 → Jc 4

Course Enjoyment

Jc 2 → Q1

.018a

Original effect

Table 9.6 Design-Specific Structural Relations between Quiz and Course Emotions

306 Multiple Group Causal Analyses

9.4 Group-specific Reciprocal Causation Models

307

design. Hence, most of the overall, original effects only apply to the flipped format. It can be assumed that the action-oriented, communicative and constructivist flipped course design is more appreciated, because the contents and skills are more easily internalized to be applied in the subsequent quizzes, which always drew on the previously learnt in-class content. Despite the higher relevance of course enjoyment in the FC, its latent mean difference however is significantly lower on average compared to the TC. For course hopelessness, no consistently different patterns for both designs could be distinguished. Significant reciprocal effects exist in both designs and apart from minor, unsystematic deviations, the reciprocal pattern is fairly consistent, and the magnitude of the coefficients is also similar, so that hopelessness seems to be a relevant influencing factor in both designs. However, it has to be kept in mind that course hopelessness on average is significantly higher in the FC with a medium effect size. Finally, Table 9.7 depicts differences in the learning-related emotion context. Table 9.7 Design-Specific Structural Relations between Quiz and Learning Emotions Learning Enjoyment Original effect

JL 3 → Q2

JL 3 → Q3

Q2 → JL 7

.017c

.012b

.082

.004

−.008a

−.080

.001

−.010a

Traditional

.012b

Flipped

.025c

ρ of 

>.100

−.004

JL 7 → Q4

JL 7 → E

.025c

.318

.008

−.005

.024

>.100

>.100

>.100

Learning Hopelessness Original effect Traditional

HL 3 → Q2

HL 3 → Q3

Q2 → HL 7

HL 7 → Q4

HL 7 → E

−.006

−.011a

−1.395c

−.003

−.014c

−.016b

−.025b

−1.125c

−.005

−.015c

−1.873c

−.004

−.012b

>.100

>.100

Flipped

.001

.003

ρ of 

>.100

.043

.087

Notes. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3. ρ of  is the p-value for the H0 that both regression weights are equal in both groups.

Comparable to the course context, learning-related enjoyment seems to be more relevant in the FC. The impact of initial learning enjoyment on subsequent quiz performance is more consistent in the FC design. In the TC design, the effect is only significant for JL 3 on quiz 2, and significantly on quiz 3 smaller than the flipped effect. The effect of quiz 2 on JL 7 is insignificant in both designs. Hence, learning-related enjoyment seems to be more performance-relevant in the

308

9

Multiple Group Causal Analyses

FC at least until the midterm, which may stem from the fact that students were offered additional learning materials to organize their off-campus workload, such as educational videos and further online material. On average, according to the latent means, students also seem to enjoy learning outside the class more in the FC throughout the semester. By contrast, prior hopelessness out-of-class seems to be a more important performance factor in the TC. Students with a lower hopelessness score better in the quizzes. This effect is significant until the mid of the semester. The hopelessness-reducing quiz effect however is significantly higher in the FC (but only at the .10 level) while the relation between hopelessness and exam is not significantly different in both designs. It could be that initial hopelessness is less relevant in the flipped course because students can learn at their own favorite pace from the start in a more autonomous learning environment, so that they do neither feel overtaxed nor unchallenged. This also comes across in the latent mean difference, according to which students in the FC feel less frustrated by the first half of the flipped semester. The quiz impact on subsequent hopelessness might be higher in the FC because the quiz is considered a more beneficial indication of the current knowledge level in the more open FC design—comparable to the quiz impact on self-efficacy. In sum, course- and learning-related hopelessness overall seem to be a relevant learning factor in both FC and TC. This stands in contrast to course- and learning-related quiz-enjoyment interrelations, which are consistently more prominent in the FC design.

9.4.3

Gender-specific Reciprocal Causative Models

Table 9.8 depicts the reciprocal relations for the expectancy model, beginning with the original effect from the overall model (section 8.3.2 und 8.3) and the effect in the female and male groups. Starting with self-efficacy, the relationship between quiz and subsequent selfefficacy are highly significant for both groups at both occasions (Q1 → S5, Q4 → S9). However, the coefficient for Q1 → S5 is almost twice as strong for male students with a significant gender-specific difference at the .10 level. The effect for males decreases to the level of the female coefficient for Q4 → S9, so that there is no more significant difference between both groups. For male students, the bigger effect of Q1 on S5 however does not translate into a significant reciprocal effect on subsequent quiz scores. Hence, after quiz 1, male students with a higher self-efficacy do not seem to achieve significantly higher scores in quiz 3 and 4. This suggests that the fist higher self-efficacy-enhancing

9.4 Group-specific Reciprocal Causation Models

309

Table 9.8 Gender-Specific Structural Relations between Quiz and Expectancy Factors Self-Efficacy Original S1 → Q1 effect .026c

S1 → Q2

Q1 → S5

S5 → Q3

.014b

1.050c

.019b

Female

.038c

.003

.708c

.030b

Male

.016

.027c

1.332c

.018

>.100

.028

.082

>.100

>.100

ρ of 

S5 → Q4

Q4 → S9

S9 → E

.001

.695c

.012b

.010

.727c

.007

−.006

.689c

.019b

>.100

>.100

Difficulty Original D1 → Q1 D1 → Q2 Q1 → D5 D5 → Q3 D5 → Q4 Q4 → D9 D9 → E effect .001 −.009 −.046 .009 −.006 .113 −.027c .003

−.002

−.459a

Male

−.002

−.013

.318

ρ of 

>.100

>.100

.050

Female

−.009

.255

−.016a

−.015

−.007

−.145

−.038c

.005

>.100

>.100

.070

.029b

Notes. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3. ρ of  is the p-value for the H0 that both regression weights are equal in both groups.

quiz effect partly stems from a nascent overconfidence because their higher selfefficacy does not lead to better performance afterwards. This assumption is also underscored by the fact that the effect of Q4 on S9 for male students aligns itself to the more moderate effect of female students, which is also similar to the first coefficient in the female group (.708). This also implies that female students might be more able to effectively process the feedback and that they profit more from the reciprocal effect between quiz performance and self-efficacy. The latent mean comparison also revealed that male students start with a higher average self-confidence compared to female students, but the difference became insignificant by the end of the semester. Running through the quizzes 2 to 4 might have helped male students to better adapt their expectancies to a more accurate level by the end of the semester. This also comes across in the significant impact of self-efficacy at t9 on the final exam score for male students, implying that the exam score increases on average by 1.9 percentage points when the self-efficacy increases by one unit. The insignificant original effects between the difficulty construct and quiz performance show almost no variation under consideration of gender differences. For female students, there is one reciprocal relationship in the midterm in such a way that better quiz performance makes them perceive subsequent statistical tasks to be easier (Q1 → D5). Subsequently, the positive regression weight indicates that

310

9

Multiple Group Causal Analyses

female students with a higher prior difficulty appraisal have a higher score in the subsequent quiz (D5 → Q3). The vague patterns might suggest that female students a priori take tasks more seriously and ties in with the findings for selfefficacy at the midterm, showing that male students tend to overestimate their own capabilities or underestimate the difficulty of the statistical task. Hence, the positive coefficient Q1 → D5 for male students suggests that, after performing well in the quiz, they revised down their difficulty appraisal and assessed statistical tasks to be more difficult9 . The final effect of difficulty on the exam is significant in both groups, but stronger for male students, so that they profit more from the performance-enhancing effect of lower difficulty appraisals. As the impact of prior self-efficacy on exam was smaller and insignificant for female students, too, it could be assumed that prior expectancy does not necessarily translate into better performance in high-stake contexts for female students. In other words, the exam characteristics might place female students at a disadvantage compared to male students. Even though the group-related difficulty relationships might suggest the existence of gender-specific coping mechanisms, it has to be borne in mind that the coefficients are mostly small and insignificant across both groups. The consistent gender-related latent mean differences in the difficulty appraisal until the end of the semester (see section 9.3) might also indicate that the quiz had a limited impact on reducing prevalent gender gaps, even though these differences only were small according to the effect sizes. In the next step, Table 9.9 depicts gender-related differences in the value causation model. Starting with the relationships between the value and interest construct as well as the quiz score, initial value only has a positive impact on Q1 for female students and initial interest on Q1 for male students. These effects are however singular. When it comes to the quiz effects on subsequent value, male students benefit significantly more than female students (p = .022). The coefficient decreases by the end of the semester for male students, but remains significant, along with the significant gender difference (p = .023). The quiz effect on interest is not significantly different in both groups (Q1 → I5), but for male students, the effect remains significant at the end of the semester. For both interest and value, the initial impact on subsequent quizzes is mostly insignificant and does not vary systematically across both groups. Strikingly, even though male students have a lower average interest and value appraisal by the end of the semester according 9

In the subsequent section 9.4.2, another explanation for the change of algebraic signs in the relations D → Q is attempted, assuming that easy tasks are more beneficial at the beginning of the semester, and more challenging tasks by the midterm.

9.4 Group-specific Reciprocal Causation Models

311

Table 9.9 Gender-Specific Structural Relations between Quiz and Value Factors Value Original V1 → Q1 V1 → Q2 Q1 → V5 V5 → Q3 V5 → Q4 Q4 → V9 V9 → E effect .006 .011a .713c .001 −.004 .369a .006 Female Male ρ of .

.011

.145

−.006

−.009

−.012

.014

1.161c

.010

.001

.788c

.008

.023

>.100

.022

>.100

>.100

.023

>.100

I5 → Q3

I5 → Q4

Q4 → I9

I9 → E

.003

.458b

−.009

.022b

.291

−.013

.021b

−.203

.003

Interest Original I1 → Q1 effect .020b

I1 → Q2

Q1 → I5

.001

.868c

−.004

Female

.005

−.005

.933c

.008

Male

.039b

.005

.911c

−.014

−.021

.081

>.100

>.100

.026

ρ of 

>.100

.696c >.100

.002 >.100

Affect Original A1 → Q1 A1 → Q2 Q1 → A5 A5 → Q3 A5 → Q4 Q4 → A9 A9 → E effect .008a .005 1.240c .010a .011b .085 .012b Female

.010

.005

.880c

.005

.007

.008

.013a

Male

.007

.006

1.621c

.019b

.021c

.210

.007

>.100

>.100

>.100

>.100

>.100

ρ of 

>.100

>.100

Effort Original E1 → Q1 E1 → Q2 Q1 → E5 E5 → Q3 E5 → Q4 Q4 → E9 E9 → E effect .009 .007 1.048c .047c .008 .835c −.002 Female

−.001

.01

1.364c

.052c .033b

Male

.012

.004

.715b

ρ of 

>.100

>.100

.093

>.100

−.002

.836c

.001

.018

.766c

−.010

>.100

>.100

>.100

Notes. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3. ρ of  is the p-value for the H0 that both regression weights are equal in both groups.

to the latent mean comparisons, they seem to draw more benefit from the quiz effects on these respective constructs. For the affect construct and for both male and female students, a better quiz score results in a more positive affect in the midterm. For male students however, the effect is approximately twice as high. Male students also reciprocally profit from a positive affective appraisal resulting

312

9

Multiple Group Causal Analyses

on higher scores on the subsequent quizzes 3 and 4. These effects are insignificant for female students, while they profit from positive affect regarding the exam score. Finally, the gender-related effects between effort and quiz mostly conform to the original effects (i.e., Q1 → E5, E5 → Q3, Q4 → E9). Whereas the third effect is not significantly different in both groups, the first two effects differ in magnitude. Female students that achieve a higher score on the quiz are significantly more motivated to invest more effort into subsequent statistical tasks compared to male students. The following reciprocal relationship is also stronger for female students (but not significantly) in such a way that higher effort translates into a higher subsequent quiz score compared to male students. The effect at the end of the semester is still greater for female students, but there is no more significant difference. It could be that female students adapt their level of effort based on their prior success and thus appraise the quiz to be less taxing by the end of the semester. The consistently higher latent means for female students on the effort construct by the end of the semester underlines the consistent importance of these gender-related differences on average, albeit the attenuated effort-enhancing quiz impact. At the end of the semester, effort does not affect the exam score, which might be due to the fact that the processing conditions of the quizzes favor arduous students due to the lack of time restrictions. The next models are to compare gender-specific effects between quiz and achievement emotion appraisals. Table 9.10 depicts the reciprocal effects for course emotions. For course enjoyment, the gender-related effects mostly conform to the original effects. The effect of quiz 1 on subsequent enjoyment is significant for female students only, but the effect of quiz 4 on subsequent enjoyment is not significantly different for both groups. Hence, neither group consistently profits more from the quiz effect. The more favorable first impact on enjoyment for female students coincides with the fact that the latent mean difference, which was in favor for male students at the beginning of the semester, became insignificant at t4 and all subsequent measurement occasions. The counterintuitive negative original effect of enjoyment on the final exam also occurs in both groups10 . Course hopelessness reveals a more consistent gender-specific pattern. Female students profit more consistently from the hopelessness-reducing quiz effect, which remains significant until the end of the semester. By contrast, the effect for males is only 10

As this effect also applies to learning-related enjoyment and in all other group comparisons, it will not be repeated anymore in the subsequent elaborations. It will be scrutinized in the discussion section.

.004

>.100

Male

ρ of  >.100

.359

−.706c −.651b −.809c

−.032b

−.039c

>.100

Female

Male

ρ of  >.100

−.022c

−.005

−.014b

Hc 4 → Q2

.057

−.076

−.905c

−.477b

Q2 → Hc 6

>.100

>.100

−.018

.003

−.007

Hc 6 → Q4

>.100

.074

−.379

−.970c

−.606c

Q4 → Hc 9

>.100

−.005

.138

.017b >.100

.449b

.005

−.026

>.100

−.014a

−.031c

−.021c

Hc 9 → E

>.100

−.012b

−.015c

−.010b

Jc 9 → E

Notes. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3. ρ of  is the p-value for the H0 that both regression weights are equal in both groups.

>.100

Q1 → Hc 4

−.037c

Original effect

Hc 2 → Q1

Course Hopelessness

.003

.434c

Q4 → Jc 9 .445b

Jc 6 → Q4 −.001

.564b

.004

Q2 → Jc 6 .020

.018c

.472b .021c

Jc 4 → Q2

Q1 → Jc 4

Course Enjoyment

Jc 2 → Q1

Female

Original effect

Table 9.10 Gender-Specific Structural Relations between Quiz and Course Emotions

9.4 Group-specific Reciprocal Causation Models 313

314

9

Multiple Group Causal Analyses

significant at the beginning of the semester. Strikingly, the gender-related gap in the latent means for hopelessness also changes from favoring males to favoring females by the midterm and remains significant until the end of the semester. Prior hopelessness also seems to be more relevant for female students in terms of exam performance; a decrease of one scale unit of hopelessness goes along with an increase of the exam score of 3.1 percentage points and for male students of 1.4 percentage points. Finally, Table 9.11 shows gender-related effects of learning emotions. Table 9.11 Gender-Specific Structural Relations between Quiz and Learning Emotions Learning Enjoyment JL 3 → Q2

JL 3 → Q3

Q2 → JL 7

.017c

.012b

.082

.004

−.008a

Female

.021c

.004

−.300

.010

−.011b

Male

.018b

.020b

.281

−.002

−.008

>.100

>.100

>.100

Original effect

ρ of 

>.100

>.100

JL 7 → Q4

JL 7 → E

Learning Hopelessness HL 3 → Q2

HL 3 → Q3

Q2 → HL 7

HL 7 → Q4

HL 7 → E

−.006

−.011a

−1.395c

−.003

−.014c

Female

−.002

−.023b

−1.146c

−.002

−.020c

Male

−.007

.000

−1.639c

−.007

−.007

ρ of 

>.100

.091

>.100

>.100

>.100

Original effect

Notes. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3. ρ of  is the p-value for the H0 that both regression weights are equal in both groups.

The effect of quiz 2 on subsequent learning enjoyment is insignificant just like in the original model. Learning enjoyment at the beginning of the semester is relevant for both groups, but the second effect (JL 3 → Q3) is only significant for male students. The latent means also underline that the gender-related differences in learning enjoyment are negligible in their magnitude and remain stable throughout the semester. Learning-related hopelessness is more often consistently reciprocal for female students than for male students. While the hopelessnessreducing effect of quiz 2 on subsequent hopelessness is slightly smaller for female students, they seem to recur more significantly on their prior hopelessness appraisal compared to male students (p = .091), which also applies to the effect on quiz 3 and on the final exam score. The lacking significance of hopelessness on quiz suggests that more or less hopeless male students did not perform worse

9.4 Group-specific Reciprocal Causation Models

315

or better, respectively. This once again leads to the assumption that for male students, the interrelations between appraisals and performance are not as consistent and accurate compared to female students. This incongruity was already suggested regarding the expectancy appraisals and also comes across in the fact that the latent mean gender difference of many constructs either decreases (i.e., self-efficacy, affect, course and learning enjoyment) or changes from favoring males at the beginning of the semester to favoring females by the end (i.e., interest, value, effort, and course and learning hopelessness). Different from the effect of self-efficacy and difficulty, the impact of hopelessness on subsequent exam score is higher for female students in the course and learning-related context. This underscores that female students seem to be guided more by their emotions rather than by their self-expectancies regarding exam situations.

9.4.4

Expertise-specific Reciprocal Causative Models

The final criterion to compare the different structural relations is the subjectrelated expertise level of students according to the final math grade. Table 9.12 compares the structural relations between expectancy and difficulty for students with a higher and lower prior knowledge (“good” vs. “bad”). Table 9.12 Expertise-Specific Structural Relations between Quiz and Expectancy Factors Self-Efficacy Original S1 → Q1 effect .026c

S1 → Q2

Q1 → S5

S5 → Q3

S5 → Q4

Q4 → S9

S9 → E

.014b

1.050c

.019b

.001

.695c

.012b

Good

.023b

.020b

1.127c

.026b

.004

.728c

.004

Bad

.019

.005

.950c

.014

−.004

.542b

.019b

>.100

>.100

>.100

>.100

ρ of 

>.100

>.100

>.100

Difficulty Original D1 → Q1 D1 → Q2 Q1 → D5 D5 → Q3 D5 → Q4 Q4 → D9 D9 → E effect .001 −.009 −.046 .009 −.006 .113 −.027c Good

.002

−.015

−.308

.012

−.010

.177

−.020c

Bad

.005

−.001

−.005

.012

.004

−.025

−.032c

>.100

>.100

>.100

>.100

>.100

>.100

>.100

ρ of 

Notes. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3. ρ of  is the p-value for the H0 that both regression weights are equal in both groups.

316

9

Multiple Group Causal Analyses

Fortunately, both more and less proficient students sustainably profit from the self-efficacy enhancing impact of quiz 1 and quiz 4 (Q1 → S5 and Q4 → S9) because they are not significantly different from each other. This suggests that the quiz feedback is conducive to students irrespective of their prior level of expertise. However, only more proficient students seem to consistently benefit from their higher prior self-efficacy while less proficient students do not seem to have sufficient mastery to beneficially recur on their confidence and that they have a more accurate self-efficacy, which translates into better performance. These mechanisms suggest that the quiz has a more confirmative function for more proficient students. On the one hand, such individuals with a higher self-efficacy also achieve better quiz scores and vice versa. On the other hand, less-versed students are seemingly not able to draw a performance-enhancing motivation from prior self-efficacy. At least, the impact of prior self-efficacy for less proficient students seems to become relevant before the final exam, implying that the quiz feedback might have helped them to assess their own capabilities more accurately. Regarding the difficulty construct, there are mostly no expertise-related differences, and the group-specific coefficients conform to the original overall effects. Only the impact of difficulty on exam performance is higher for less proficient students. Hence, less proficient students profit more strongly if they perceive a task to be easy. This relation also underlines that, by the end of the semester, less versed students likely succeeded in building up an accurate self-appraisal of their abilities in coping with statistical tasks. Table 9.13 continues with the comparison of the value-related constructs for more and less proficient students. For the value and interest constructs, the most striking difference is that the impact of quiz 1 on subsequent appraisals is higher for more proficient students (and significantly higher only for Q1 → V5). While the impact of quiz 1 on interest is at least weakly significant, the impact on value is not significant for less proficient students. The impact of quiz 4 on end-of-semester value and interest is also higher for more knowledgeable students, but became insignificant for the value construct, which may be due to the reduced group sample. This finding is aggravated by the fact that the latent mean differences for interest and value also deteriorate as the semester proceeded, suggesting a growing divergence regarding value appraisals to the detriment of less proficient students. It could be assumed that more proficient students profit more from the interest- and value-enhancing quiz effect because they are more acquainted with the subject matter and know better how to process the feedback from the quiz in personally favorable ways. Moreover, value- and interest-related preconditions tend to

9.4 Group-specific Reciprocal Causation Models

317

Table 9.13 Expertise-Specific Structural Relations between Quiz and Value Factors Value Original V1 → Q1 V1 → Q2 Q1 → V5 V5 → Q3 V5 → Q4 Q4 → V9 V9 → E effect .006 .011a .713c .001 −.004 .369a .006 Good

.005

.002

1.131c

.006

.002

.428

.000

Bad

.003

.021b

.255

−.004

−.014

.142

.017

>.100

.081

.052

>.100

>.100

>.100

>.100

I5 → Q3

I5 → Q4

Q4 → I9

I9 → E −.009

ρ of 

Interest Original I1 → Q1 effect .020b Good

.012

Bad

.027a

ρ of 

>.100

I1 → Q2

Q1 → I5

.001

.868c

−.004

.003

.458b

.011

1.102c

−.004

−.015

.571b

−.003

−.009

.474a

−.006

.026a

.191

−.021

>.100

.032

>.100

>.100

>.100

>.100

Affect Original A1 → Q1 A1 → Q2 Q1 → A5 A5 → Q3 A5 → Q4 Q4 → A9 A9 → E effect .008a .005 1.240c .010a .011b .085 .012b Good

.004

.004

1.246c

.011

.014b

Bad

.006

.002

1.123c

.009

.004

−.161

>.100

>.100

>.100

>.100

>.100

>.100

ρ of 

.380

.005 .012c >.100

Effort Original E1 → Q1 E1 → Q2 Q1 → E5 E5 → Q3 E5 → Q4 Q4 → E9 E9 → E effect .009 .007 1.048c .047c .008 .835c −.002 Good Bad ρ of 

.004

.764c

.048c

.004

.012

1.014c

.046c

>.100

>.100

>.100

.014

>.100

.009

.662b

.009

.818b

>.100

>.100

−.006 .006 >.100

Notes. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3. ρ of  is the p-value for the H0 that both regression weights are equal in both groups.

be more relevant for less proficient students. It could be that less proficient students base their preceding performance-related appraisals on aspects belonging to the subject itself, in this case, its value and interestingness rather than on their supposedly lower expertise or self-belief (the insignificant impact of prior self-efficacy on quiz for less expertized students). The affect and effort structure behaves mostly stable across both groups. The only mentionable difference for effort is that Q1 → E5 and Q4 → E9 are higher for less proficient students.

318

9

Multiple Group Causal Analyses

This could stem from the assumption that, under otherwise similar conditions, less proficient students might feel more strained because they had to catch up on their lower level of prior knowledge to perform better in the quizzes. Despite the similar effects for affect and effort, the latent mean difference for affect becomes significantly worse for less proficient students by the end of the semester. Regarding effort, they averagely felt less motivated to invest more effort compared to more proficient students. This conflicts with the higher quiz effect on effort for less proficient students, but suggests that they need concrete incentive structures, such as these mandatory quizzes, to foster their willingness to invest effort. Finally, the interrelations between quiz and achievement emotions will be compared for more and less knowledgeable students. Table 9.14 starts with the comparison across course emotions. For course enjoyment, more proficient students tend to profit more from the impact of quiz 1 on course enjoyment and the subsequent reciprocal effect from course enjoyment to quiz 2, which are both higher (but not significantly so) compared to less proficient students. The coefficients of quiz 4 on course enjoyment are weakly significant in both groups and do not differ significantly from each other. These findings suggest that the enjoyment of the course is associated with the subject-specific knowledge level. Course enjoyment is likely more relevant for more proficient students because they are already more acquainted with the subject-related matter and can thus appreciate the course itself. By contrast, particularly at the beginning of the semester, less proficient students might lack the necessary expertise to immediately enjoy the course. Accordingly, the mean difference in course enjoyment between less and more proficient students decreases throughout the semester. Apart from the differing beginning effects, there is no consistently differential pattern for course enjoyment. Regarding course hopelessness and similar to the design-specific models, the structural relations are mostly stable across both levels of expertise and no consistent or systematic deviations could be found for both groups, apart from slight differences in the temporal occurrence of the effects. Strikingly, even though both groups similarly profit from the quiz, the latent mean difference in course hopelessness still increases to the detriment of less knowledgeable students by the end of the semester. Table 9.15 shows the structural relations of learning emotions across levels of expertise. For learning enjoyment, only the initial appraisal of the group of more knowledgeable students is higher compared to less knowledgeable students. This difference conforms to that of course emotions and might involve similar reasons (i.e., more knowledgeable students appreciate learning more easily on grounds of their better expertise). Apart from that, all relations are insignificant in both

>.100

ρ of  >.100

.146

−.706c −.665b −.817c

−.031c

−.037c

>.100

Good

Bad

ρ of  >.100

−.014

−.015b

−.014b

Hc 4 → Q2

>.100

.015

>.100

−.591a

−.215

−.477b

Q2 → Hc 6

>.100

>.100

−.014

.005

−.007

Hc 6 → Q4

.094

−.278

−.878c

−.606c

Q4 → Hc 9

>.100

−.012 >.100

.414a

.006

−.251 .422

.405a

.434c

Q4 → Jc 9

−.001

Jc 6 → Q4

.020

Q2 → Jc 6

>.100

−.025c

−.015b

−.021c

Hc 9 → E

>.100

−.011a

−.011b

−.010b

Jc 9 → E

Notes. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3. ρ of  is the p-value for the H0 that both regression weights are equal in both groups.

>.100

Q1 → Hc 4

−.037c

Original effect

Hc 2 → Q1

Course Hopelessness

.001

.020c

.768c

.004

Bad

.018c

.472b

.008

Jc 4 → Q2

Q1 → Jc 4

Course Enjoyment

Jc 2 → Q1

Good

Original effect

Table 9.14 Expertise-Specific Structural Relations between Quiz and Course Emotions

9.4 Group-specific Reciprocal Causation Models 319

320

9

Multiple Group Causal Analyses

Table 9.15 Expertise-Specific Structural Relations between Quiz and Learning Emotions Learning Enjoyment JL 3 → Q2

JL 3 → Q3

Q2 → JL 7

.017c

.012b

.082

.004

−.008a

Good

.027c

.011

.305

.010

−.010a

Bad

.006

.011

−.035

−.006

−.003

.040

>.100

>.100

>.100

Original effect

ρ of 

JL 7 → Q4

JL 7 → E

Learning Hopelessness Original effect Good Bad ρ of 

HL 3 → Q2

HL 3 → Q3

Q2 → HL 7

HL 7 → Q4

HL 7 → E

−.006

−.011a

−1.395c

−.003

−.014c

.004

−.007

−1.344c

.002

−.011b

−.015b

−.010

−1.493c

−.009

−.010b

.049

>.100

>.100

>.100

>.100

Notes. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3. ρ of  is the p-value for the H0 that both regression weights are equal in both groups.

groups, conforming to the small magnitude of the original coefficients of the general model. Regarding learning hopelessness, only the impact of prior emotion on quiz 2 is significantly higher for less knowledgeable students. This however seems to be a singular finding because only the first coefficient is significant for this group. The latent means of learning emotions also decrease to the detriment of less knowledgeable students, suggesting that the quiz feedback do not suffice to compensate the expertise-related average differences by the end of the semester, even though the effects are mostly similar for both groups.

9.4.5

Implications and Follow-up Questions from the Multiple Group Analyses

To sum up the key findings from the multiple group analyses per construct, male students and students from the FC seem to profit more from the initial effect of quiz on self-efficacy. In both cases, the initially higher effects bottom out in size at the level of the initially weaker coefficient by the end of the semester. The quiz impact on self-efficacy in the midterm does not vary significantly depending on proficiency, but it persists until the end of the semester only for more versed students. From the beginning of the semester up to the exam, reciprocal effects were

9.4 Group-specific Reciprocal Causation Models

321

found in all groups, which only differed in their time of occurrence. For difficulty, no consistently significant pattern was found, while the negative relation between difficulty appraisal and exam score was robust across all groups. For interest and value, quiz impacts on subsequent appraisals were consistently in favor of male students, more proficient students, and students within the FC. The same pattern holds true for affect, except for the lack of consistent expertise-specific differences. Another pattern that emerged from the FC group was that expectancyand value-related preconditions on subsequent quiz performance began to play a significant role only after the first half of the semester. Reciprocal effects of the effort construct are stable across all group, except for female students and TC students profiting more from the impact of the first quiz while these effects bottom out by the end of the semester. More proficient students and students from the FC profit from the reciprocal relations between course enjoyment and quiz performance whereas no consistent gender-specific patterns were found. The effect from quiz 2 on subsequent learning enjoyment is insignificant across groups while prior learning enjoyment is more relevant for students from the FC and more proficient students. The reciprocal relations between course and learning-related hopelessness with quiz performance are also similar across all groups, except for being more consistent for female students. Even though many regression coefficients measuring the quiz impact on subsequent appraisal are significantly weaker in certain groups, they are at least still significant or are significant at another measurement occasion. Moreover, the Wald tests are mostly insignificant, suggesting that the structural relations are fairly consistent across the groups. In that regard, the only consistently insignificant relationships of quiz performance on subsequent appraisal throughout the complete semester concern value for female students, difficulty for male students, and value/interest for less proficient students11 . Strikingly, no quiz impact of subsequent expectancy, value, or achievement emotion consistently disappears depending on the course design, except for the quiz impact on subsequent course enjoyment, which is only significant in the FC. Hence, the starkest differences in quiz reception seem to stem from prior student characteristics. There is more variance concerning the significance of the relationships of prior appraisals on

11

There are other relationships of quiz on subsequent appraisal which are insignificant in one group. This concerns interest at t9 for female students, course hopelessness at t6 and t9 for male students, self-efficacy at t9 , interest at t9 , course hopelessness at t9 , and course enjoyment at t4 for less proficient students. These insignificant relationships with the respective appraisals were only occasional, i.e., other impacts throughout the semester remained significant so that the reciprocal relations in the cases mentioned remained intact.

322

9

Multiple Group Causal Analyses

subsequent quiz performance. Most of the group-related differences in these relationships can however be considered unsystematic, i.e., the impact of self-efficacy on quiz 1 is significant for females, and the impact on quiz 2 is significant for males. Other coefficients, such as D5 → Q3 for female students, occurred once only, so that no systematic pattern can be inferred. The only systematic grouprelated deviations for appraisals on subsequent quiz performance were found for prior affect, which was only significant for male students, prior learning hopelessness (female students), prior self-efficacy (TC design), and prior interest (less proficient students). Considering the occasionally different magnitudes of the additive relationships, three follow-up questions arise that will be presented within the scope of secondary findings. The presentation will be oriented towards the effect mechanism starting with prior appraisal on quiz performance (1) on subsequent appraisal (2 and 3). First, when considering the occasionally different magnitudes of the additive effect mechanisms across groups, the question arises whether the multiplicative effects between expectancy- and value constructs on subsequent quiz performance and emotion that were found in section 8.4.4 are stronger in specific groups only. Secondly, the multiple group analyses revealed that some groups seemed to profit more consistently from the quiz impact for several motivational and emotional facets. For instance, male and more proficient students, as well as students in the FC profit more from the self-efficacy-enhancing quiz impact at the midterm, or female students and students in the TC profit more from the quiz impact regarding subsequent effort at the midterm. Thus, the follow-up question comes up whether one of the two course formats particularly benefits specific group of learners according to their gender or prior knowledge regarding the quiz impact on subsequent appraisals. Third, the additive CV model (see section 8.4.2) suggested that achievement emotions, in a direct comparison to the EV facets, play lesser important role regarding the feedback effect as the regression weights of achievement emotions were more often insignificant or smaller in magnitude. The focus of the third question will therefore be whether achievement emotions are, after all, relevant in the whole feedback process or whether they can be neglected.

9.5 Secondary Findings on Group-specific Moderation Effects

323

9.5

Secondary Findings on Group-specific Moderation Effects

9.5.1

Group-specific Multiplicative Expectancy-value Effects

Chronologically starting with the effect mechanisms arising from prior appraisals on subsequent quiz performance, section 8.4.4 revealed that there are occasional multiplicative effects between EV on subsequent quiz performance and achievement emotions. More concretely, latent interaction terms were generated for self-efficacy and interest (S×I) as well as self-efficacy and value (S×V). These interaction terms significantly impacted subsequent learning enjoyment at t3 and t7 as well as quiz 1 (S×V). The significant interaction effects of S×I on quiz 2 and quiz 4 performance were only meaningful under unattainably high self-efficacy manifestations12 . In order to analyze whether these multiplicative effects are robust or only hold true for specific groups, the models MCI and MCV from section 8.4.4 were submitted to another multiple group analysis according to design, gender, and prior knowledge13 . Since Mplus does not allow latent interaction terms in the traditional multiple group analysis, a multiple group mixture model was performed in which the three grouping variables were used as KNOWNCLASS variables. In mixture models, the overall model is specified by the label %OVERALL% while the model to be estimated differently has the label %c#2%, representing group number 2, i.e., the second manifestation of the dichotomous grouping variables. The overall model in the subsequent analyses was set up in such a way, that the measurement models were held equal in both groups. Moreover, weak measurement invariance across time and strong measurement invariance of intercepts were

12

In the context of the group-specific analyses, the interaction effects on the constructs that were insignificant in the overall interaction models (i.e., course-related enjoyment and hopelessness as well as learning-related hopelessness) were reconsidered again to see whether significant interactions might exist in these groups. However, in all group comparisons, these interaction effects were either insignificant, only significant at single measurement occasions, only meaningful in unusually low or high value ranges of the moderator variable self-efficacy or did not vary systematically across each of the two groups. 13 The recourse to models M CI and MCV includes controlling for all main effects, autoregressive and cross-lagged relationships, and modelling of interaction effects for self-efficacy and interest at two measurement occasions (t1 and t5 ). As this chapter focusses on secondary findings on the differential function of interaction effects, the regression coefficients (i.e., reciprocal, cross-lagged, autoregressive) and the model fit will be omitted. For a graphical representation of the models, refer to Figure 8.9 and Figure 8.10.

324

9

Multiple Group Causal Analyses

applied analogous to the prior analyses. Once again, self-efficacy is used as moderator variable to see in how far the main effects vary on ground of below- and above-average self-efficacious students (1 SD below and above the average selfefficacy) in the different groups. In order to abbreviate the presentation of the secondary findings, the EV interaction effects on learning enjoyment and quiz performance will be tabulated in Table 9.16 instead of using Johnson-Neyman plots14 . Table 9.16 Group-Specific Multiplicative Expectancy-Value Effect on Learning Enjoymen Levels of Self-Efficacy Design

Gender

Prior Experience

Traditional Flipped Female Male

High

Low

ß

ß

ß

ß

ß

ß

SV1 → JL 3 S×V

.057c

.119c

.068

.128c

.157c

Low (−1 SD)

.023

.039

.051

.021

Average

.081

.158c

.118b

.149b

.099a

.156c

High (+1 SD)

.138a

.277c

.186b

.277c

.256c

.222c

−.057

.066 .091

SV5 → JL 7 Traditional Flipped Female Male .083c

.101c

High

Low −.007

S×V

.014

.018

.114c

Low (−1 SD)

.021

−.086

−.043

.017

−.112a

.045

Average

.035

−.002

.058

.035

.002

.038

High (+1 SD)

.048

.081

.158c

.052

.116a

.031

SI1 → JL 3 Traditional Flipped Female Male

High

Low

S×I

.038

.076

.032

.126b

.099b

Low (−1 SD)

.169a

.447c

.303c

.344c

.313c

.015 .371c

Average

.206b

.523c

.334c

.471c

.412c

.386c

High (+1 SD)

.244b

.598c

.366c

.597c

.510c

.401c

(continued)

The interaction effects of SV → Q and SI → Q were omitted in Table 9.16 as they yielded no consistently group-specific interaction pattern across the three levels of self-efficacy in none of the groups at all measurement occasions.

14

9.5 Secondary Findings on Group-specific Moderation Effects

325

Table 9.16 (continued) Levels of Self-Efficacy Design

Gender

Prior Experience

Traditional Flipped Female Male

High

Low

ß

ß

ß

ß

ß

ß

SV1 → JL 3 SI5 → JL 7 Traditional Flipped Female Male .128c

.132c

S×I

.068

Low (−1 SD)

.055

−.036

Average

.123

.092

.194b

High (+1 SD)

.190a

.220b

.326c

.062

.078a

High .164c

Low .046

−.034

−.036

.012

.044

.128

.058

.121

.292c

.104

Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.

A comparison of the magnitude and significance reveals that the multiplicative effects of S×V and S×I on learning enjoyment and quiz are consistently only significant in the FC and for more proficient students. Only the interaction effect of S×I on JL 3 is insignificant in both designs, but still higher in the FC. No consistently systematic gender patterns seem to underlie the interactions on learning enjoyment15 . A further consideration of the three levels of self-efficacy in the different groups allows for a better insight into the concrete effect mechanisms16 . In the TC, the effect of V1 → JL 3 only becomes significant for students with an above-average self-efficacy while already average self-efficacious students profit from this effect in the FC. Moreover, the slope of the effect is higher in the FC, implying that the impact of increasing self-efficacy is more effective regarding the enjoyment-enhancing effect of prior value compared to the TC. Regarding the impact of SI1 → JL 3, interest seems to be a more prominent enjoymentenhancing factor compared to value in both designs because even below-average self-efficacious students profit significantly from the enjoyment-enhancing effect of interest. Despite the insignificance of the interaction effects for JL 3, a comparison between the unstandardized regression coefficients of both designs suggests 15

For instance, the interaction effect on learning enjoyment, the significance of the interaction alternates between the male and female group (i.e., SV1 → JL 3 is significant for males whereas SV1 → JL 7 is significant for females). 16 All inconsistent effects, such as the gender-specific interaction effects on learning-related enjoyment will not be further elaborated due to their unsystematic group-specific patterns.

326

9

Multiple Group Causal Analyses

that students from the FC profit more from the enjoyment-enhancing interest effect across all levels of self-efficacy. Concerning measurement occasion t7 , the group-specific interaction effects remain consistent to t3 , i.e., in favor of students in the FC. However, the impact of interest on learning enjoyment is only significant for above-average self-efficacious students in both designs, and the impact of value on enjoyment becomes insignificant across all three levels of self-efficacy in both designs. Even though the design-specific moderation effect of self-efficacy and value on learning enjoyment becomes minuscule by the end of the semester, the significant interaction effects for both occasions still confirm the consistency of the pattern. The differential functioning of the interaction on learning enjoyment based on prior knowledge is less consistent for V1 → JL 3 because the ß coefficients for high and low proficient students to not differ much from each other on each level of self-efficacy despite the significant interaction effect17 . The expertiserelated interaction of self-efficacy and interest on subsequent learning enjoyment yields a clearer picture. Concerning SI1 → JL 3, the impact of interest on subsequent learning enjoyment increases more strongly with increasing levels of self-efficacy compared to the lower proficient students. More concretely, aboveaverage self-efficacious and simultaneously higher proficient students profit more from the enjoyment-enhancing interest effect than lower proficient above-average self-efficacious students. This difference seems intuitive as it can be assumed that higher proficient, and simultaneously highly self-efficacious students can draw disproportionately more profit from the interest effect due to their actually more qualitative prior experiences (i.e., final math grades). While the enjoymentenhancing value effect for JL 3 is significant for low and high proficient students across all levels of self-efficacy, I5 → JL 7 is only significant for high proficient students with a high self-efficacy. By the end of the semester, the interestenjoyment relationship is only significant for high-proficient students who also assess themselves to have a higher self-efficacy. In sum, these secondary findings only suggest tendential group-specific EV interaction effects that still lack consistence in most groups, particularly when considering the ß coefficients at the different levels of self-efficacy. In the FC, the enjoyment-enhancing effect of interest and value already applied to students with a below-average and average self-efficacy, respectively, whereas only above-average students profited in the TC. It could be assumed that the intrinsic 17

The interaction effect is likely still more significant for the high proficient students because the ß-coefficients of low and high self-efficacy (−.057 vs. .256) are considerably different from each other.

9.5 Secondary Findings on Group-specific Moderation Effects

327

motivation is more readily conveyed and promoted in the FC from the beginning of the semester so that even students with only an average level of self-efficacy are more easily intrinsically motivated to enjoy the course. Moreover, increasing self-efficacy tends to be more fruitful in the FC because higher self-efficacy levels lead to a higher increase of the enjoyment-enhancing value and interest effects compared to the TC at least at the beginning of the semester. Hence, selfefficacy might be a stronger catalyst in the FC for the relation between value and learning enjoyment. Regarding the level of proficiency, the results suggest that higher proficient students profit more from the enjoyment-enhancing effect of interest and value with increasing self-efficacy compared to low proficient students. Higher proficient students might profit more from the value appraisal when they are above-average self-efficacious because then, their self-appraisal is more congruent with their actually good prior experience. At t7 , these effects are even reserved for high proficient students with an above-average (assumingly accurate) self-efficacy. The pattern however shows that additional measures to promote the self-efficacy of lower proficient students (e.g., adaptive feedback or more informative feedback) could further increase the impact of intrinsic motivation on achievement emotions.

9.5.2

Design-specific Quiz Effect Depending on Gender and Prior Knowledge

The findings from section 9.4 suggest that certain feedback effects are stronger in certain designs and groups of students according to gender and prior experience. Some relationships showed a similarly higher impact for two different groups—such as Q1 → S5, which was higher for male students and students in the FC. Moving one step forward in the effect mechanism chain from groupspecific feedback-influencing factors to group-specific feedback corollaries, the question arises whether, for instance, male students profit more from the feedback effect in a FC compared to female students. This follow-up question was investigated by means of another multiple group analyses in which interaction terms were generated with the dichotomous grouping variables. To analyze whether specific groups profit more in a specific design, the manifest interaction terms quiz×gender and quiz×experience were defined18 , and the four known separate 18

The quiz score that just preceded the respective exogeneous appraisal were factored in the interaction. For instance, value at t9 was regressed on the interaction of gender×quiz score 4.

328

9

Multiple Group Causal Analyses

structural models19 were submitted to another multiple group analyses with the grouping variable design (0 = traditional, 1 = flipped). The measurement and structural models were set up as in section 9.4, except from the fact that the subsequent appraisals were additional regressed on either gender or prior experience to control for these main effects, as indicated in Table 9.17. Table 9.17 Design-Specific Quiz Effects according to Gender and Proficiency Level Design-specific quiz effect Q1 → S5 Q4 → S9

Q4 → D9

Q1 → V5

Q4 → A9 Q1 → E5

Low

ß

ß

−.098

.490

−.522

.504

1.330c

1.807c

−.238

1.755c

1.517c

−.001

.685b

.685b

−.220

.747b

.527

.420

.575a

.995c

−.341

.838c

.496

Traditional Traditional

−.225

−.018

Flipped

1.531c

−.685a

Traditional Traditional Traditional

−.018

−.243

−.286

−.046

.332

.846

.341

.054

.116

−.321

.182

−.139

.061

−.088

.226

−1.108b

.553

−.556

−.284

.258

−.026

.891 −.381 .405

.058

.948b

.301

.759

1.173c

.792

−1.300b

1.666c

.366

.157

.563

−.396

.572

.176 .216 .083

.457

Flipped

−.635

.870c

.236

−.486

.702a

Traditional

1.049

.018

1.067b

−.762

.845a

.680

.572

1.251b

−.992

1.441c

−.616

.987b

−.497

.420

−.077

.218

.689a

.109

.322

.431

Traditional Flipped

Q1 → A5

High

.588

Flipped Q4 → V9

Expert ×Quiz

.476

Flipped Q4 → I9

Male ß

Flipped

Flipped Q1 → I5

ß

Traditional

Flipped Q1 → D5

Gender×Quiz Female

1.604c .471

Traditional

1.064

.131

1.195c

−.341

1.049a

Flipped

−.293

1.435c

1.142a

−.446

1.535c

Traditional

−.916

.872b

.449

.708 1.089a

−.044

−.236

.618

.382

Flipped

.241

.014

.254

−.755

.452

−.303

Traditional

.101

1.228c

1.329c

−.905

1.021c

.116

Flipped

.296 −.097

1.141c .544

1.437c .447 (continued)

19

Analogous to these structural models, the cross-lagged, autoregressive paths were controlled for too. To abbreviate the presentation of the secondary findings, their regression coefficients will not be depicted in further detail.

9.5 Secondary Findings on Group-specific Moderation Effects

329

Table 9.17 (continued) Design-specific quiz effect Q4 → E9

Traditional Flipped

Q1 → Jc 4

Traditional Flipped

Q2 → Jc 6

Traditional Flipped

Q4 → Jc 9

Traditional Flipped

Q1 → Hc 4 Traditional Flipped Q2 → Hc 6 Traditional Flipped Q4 → Hc 9 Traditional Flipped Q2 → JL 7

Gender×Quiz Female ß

High

Low

ß

ß

.671a

.784b

.319

.532

.851a

−.255

.768a

.513

.468

.428

.896a

.401

.598c

.045

.643

−.068

.347b

.200

.547

.420c

−.417

.002

.083 −1.433a

.319 1.365b

.501

−.414

.086

−.005

−.215

−.222

.208

−.277

−.069

.256

.082

.338

.032

.248

.280

−.114

−.575a

−.386

−.443

−.829a

.002

−.560a

−.712

−.967

−1.679c

−.028

.454

−.686

−.232

−.082

−.434

−.516a

−.182

−.006

−.242

−.121

−.063

−.185

−.733c

−.424a

−.502

−.504a

−.512

.912b −.162

Traditional Flipped

.312

Flipped

Expert ×Quiz

.113

.882a

Q2 → HL 7 Traditional

Male ß

.737 −1.125a

.796b

.285

−1.040c

−.128

−.366

−.528

−.530

.352

.169

.481

.309a −.038 −1.181a

−1.574c −1.602c

.688

.141

.381

−.832b

−.942

−.640

−1.196b −2.321c

.479

−1.569c

.461 −.558a

−.493 .521 −1.582c

−2.020c −1.541c

Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.

Even though only few of the interaction effects are significant, suggesting that both gender and prior knowledge have no consistent moderating effect in either course design, detailed consideration of the unstandardized coefficients in each group per design allows for an evaluation of potentially group-specific motivational mechanisms. Starting with gender-related differences in expectancy appraisals, the self-efficacy-enhancing impact of quiz 1 (Q1 → S5) is stronger for male students in both designs. However, the effect is insignificant in the TC design while both males and females profit significantly in the FC with a slight advantage for male students at both measurement occasions. Even though the interaction effect is not significant in both designs, these findings partly confirm the pattern from the prior multiple group analyses, suggesting that both students in the FC and male students profit more from the feedback regarding their self-efficacy. Particularly at t9 , the two-group comparison suggested that

330

9

Multiple Group Causal Analyses

the effect Q4 → S9 is similar across groups, while the interaction here suggests that only in the TC design, both genders profit equally from the quiz, while the FC design broadens the gender-related gap in feedback reception throughout the whole semester. Despite these differences, both genders profit significantly in both designs except for the beginning of the traditional semester. Regarding the impact of Q1 → D5, there is highly significant interaction in the FC at both measurement occasions. The significant difference between the regression weights of the male and female groups stems from the fact that they have different algebraic signs, indicating once again the presence of different motivational mechanisms (see section 9.4.3). While female students at t5 a priori might have been more self-critical and find statistical tasks to be easier after they performed well at the quiz, male students seemed to be overconfident and revise their difficulty appraisal upwards after performing well at the quiz. Regarding the magnitude of the effect for male students in the FC at t5 , it could be assumed that the corrective effect is stronger in the FC. The effects might be greater in the FC due to the stronger focus on action-oriented tasks in the lecture, tutorials, and off-campus phases throughout the semester. For t9 , the algebraic sign of the regression weights in the FC switches between male and female students, even though they are not significant anymore. Hence, the different feedback effects regarding difficulty do not seem to underlie a consistent gender-related pattern but at least contribute to adaptation processes one way or the other. The underlying different and alternating mechanisms (confirmative vs. corrective) might also be the reason behind the lack of differences in the average difficulty throughout the semester and the insignificance of the overall regression weights, which cancel each other out in specific groups. Regarding the quiz-interest relationships, despite the insignificant interaction effects and partly insignificant regression weight in each group, it seems that male students profit more in the TC and female students in the FC. The gender-related pattern for the quiz-value relationship is more consistent in such a way that, irrespective of the design, male students profit more from the value-enhancing effect of feedback. This conforms to the gender-specific multiple group results, in which the value effect differed strongly in favor for male students while the designspecific differences were more balanced (see section 9.4.2 and 9.4.3). The impact of feedback on positive affect varies rather inconsistently across the groups. In the TC design, the positive impact of quiz on affect for male students is significant at the midterm, while female students profit more at the end of the traditional semester. In the FC, both genders profit equally, but only at the beginning of the semester.

9.5 Secondary Findings on Group-specific Moderation Effects

331

The effort-enhancing feedback effect at t9 is quite consistent across both designs and gender. However, at t5 , male students do not profit from this effect in the FC design while both genders profit equally in the TC design. This discrepancy might be the reason why the quiz effect on effort in the FC was significantly weaker in the prior multiple group analyses (see section 9.4.2) as it only applied for female students. In the FC, there is thus no relation between good quiz performance and effort for male students. The absent effect however could also indicate that adaption process take place in such a way that male students with a lower quiz score decide to invest more effort and vice-versa, and thus lower the positive correlation. This need-based reaction might only apply for males in the FC as their overconfident tendencies (see section 9.4.3) might be more easily levered out when they are induced to be more active from the beginning of the semester. The enjoyment enhancing quiz-effect is only significant for female students in the FC at t4 and t9 , even though the difference is only significant at the firstmentioned occasion as indicated by the significant interaction (−1.433). This conforms to the prior multiple group analyses, in which the impact at t4 was greater both for female students and in the flipped semester when considered separately. Hence, female students seem to respond better to the enjoymentenhancing feedback effect in the FC while male students do not seem to profit in either design. Regarding course- and learning-related hopelessness, female students rather profit from the hopelessness-reducing impact of feedback in the TC while male students profit more in the FC (except for the traditional group at t4 ). The discrepancy is most considerable for the relationship of Q2 → HL 7 in the flipped semester. The differential functioning could be due to the gender-specific affinities for both designs. For instance, male students could more easily deal with the more open structure of the FC and the uncertainties arising therefrom while female students might feel more comfortable in the firmly structured TC design. Due to the lacking significance of some interaction effects and regression weights, this can however only be considered to be vague pattern. Moving to expertise-specific moderating effects in both designs, no significant interaction effects were found for self-efficacy and difficulty. Analogous to the findings from section 9.4.4, higher proficient students profit longer (i.e., up to t9 ) from the self-efficacy enhancing quiz impact irrespective of the design. Moreover, the already reported stronger effect of feedback on self-efficacy in the FC at t5 also prevails in these interaction models (see section 9.4.2). For difficulty, no systematic group- and design-specific pattern could be determined due to the insignificance and low magnitude of the regression weights and interaction terms. Regarding interest and value, high proficient students profit more consistently irrespective of the design (see section 9.4.4). The most considerable differences

332

9

Multiple Group Causal Analyses

occur in the FC design at t5 , where high proficient students profit more from the interest- and value-effect of the feedback compared to low proficient students. The interaction effect is only significant for Q1 → I5 (−1.3). The higher proficient students still profit more from the quiz effect on interest in the FC design at t9 , but the interaction effect is insignificant. This synergetic effect conforms to the separate analyses (see section 9.4.2 and 9.4.4), where both high proficient students and student from the flipped semester profited more from the interest and value effects emerging from quiz feedback. It could be assumed that higher proficient students profit more from the quiz feedback in the FC because they might be better able to process the feedback in favorable ways concerning their intrinsic motivation when learning more self-regulated. By contrast, less proficient students might profit less because they feel like being left on their own and need further guidance to process the feedback efficiently. For all subsequent relationships (affect, effort, course- and learning related emotion), even though the interaction effects were occasionally significant, no consistently systematic expertise-related pattern was found for the two designs, so that the moderating effect of proficiency seems negligible. This conforms to the robustness of the relationships reported for the expert-related multiple group analyses for these constructs (see section 9.4.4). In sum, gender-specific moderation effects seem to be more prevalent in the different designs compared to expertise-related effects, even though most of the interaction effects were insignificant. When also considering the difference in the gender-related regression weights in both designs, male students tend to profit more consistently in the FC design regarding self-efficacy, and course- and learning-related hopelessness. It thus seems that processing quizzes in the flipped course design helps male students to better estimate whether they can cognitively understand and emotionally handle the statistics course. Female students, by contrast, profit more consistently in the FC regarding the quiz impact on constructs involving intrinsic motivation, i.e., interest, effort, and course enjoyment. Moreover, female students come off better in the TC design regarding course and learning hopelessness, suggesting that they might a clear structure to better adapt their capacity to handle the course. Concerning expertise-related effects, high proficient students are more susceptible to the interest- and value-enhancing quiz effects in the FC, which suggests that intrinsic motivation is propelled more efficiently in the FC in association with a higher prior level of knowledge.

9.5 Secondary Findings on Group-specific Moderation Effects

9.5.3

333

The Moderating Effect of Achievement Emotions on the Expectancy-value Feedback Effects

After having investigated the moderating effect of gender and prior knowledge on the quiz feedback effect in different designs, we shift the lens to another potential moderators of the feedback effect, i.e., achievement emotions. When the achievement emotions and EV constructs were combined into the additive CV model (section 8.4.2), achievement emotions lost their explanatory to some degree because some regression weights became insignificant and smaller in magnitude compared to the others (i.e., Q1 → Jc 4, Q2 → Hc 6, and Q2 → HL 9). Learningrelated enjoyment in particular did not consistently interrelate with quiz feedback throughout the semester. The findings suggested that EV appraisals dominated the feedback reception process and that, in difference to the CV theory, prior emotions more strongly predicted subsequent EV appraisals. From these findings, the question arises in how far achievement emotions meaningfully support the feedback process. To further analyze these effect mechanisms, several further models were computed in which achievement emotions functioned as moderator variables for the relationship of quiz performance on subsequent EV appraisal. For these analyses, interaction terms between the emotional appraisal and the quiz score preceding the respective EV constructs were generated by means of “XWITH”. The EV constructs were then regressed on these two predictors, their interaction, and all other preceding constructs according to the prior models20 . To reduce the computational demand to conduct the analyses of TYPE = random with more than three integration points, each emotional construct was analyzed separately taking the additive CV model from section 8.4.2 as a basis. Three new structural models were defined; (1) course hopelessness CV model, (2) course enjoyment CV model, and (3) learning-related CV model by means of which the interaction effects were investigated. Figure 9.1 exemplarily shows model (3)21 . The model shows that two interaction terms were generated for quiz score 1 × HL 3 and quiz score 4×HL 7, named Q×H. The interaction of learningrelated emotions at t7 and quiz score 4 was not generated because the slopes were insignificant and small. The interaction effects in the interaction models in general were only sporadically significant. However, consideration of the quiz

20

According to the sections 9.5.1 and 9.5.2, the autoregressive and cross-lagged effect were controlled for, too, but the values will be omitted for reasons of brevity. 21 The graphical path models for the other two course-related models can be found in Appendix 25 in the electronic supplementary material.

334

9

Multiple Group Causal Analyses

S1

S5

S9

I1

I5

I9

V1

V5

V9

E1

E5

E9

Q1

Q2

Q3

Q4

Q*H

Q*J

HL7

HL3

JL3

JL7

EX

Figure 9.1 Path Diagram for the Multiplicative Relations of Quiz and Learning Emotions on Expectancy-Value Appraisals. (Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.)

effects on condition of under- and above-average manifestations of the achievement emotions allows for a more thorough evaluation of their relevance in the feedback processes. Accordingly, Table 9.18 shows the quiz impact on subsequent EV appraisals under condition of low prior course hopelessness. Even though none of the interaction effects on subsequent EV appraisal is significant, there is a recurring pattern suggesting that higher hopelessness levels diminish the quiz effect on subsequent EV appraisal. For instance, students with an under-average hopelessness profit more from the self-efficacy-enhancing quiz effect (Q1 → S5) while the feedback effect is levered out when students are above-average hopeless. The same pattern applies for the quiz effects on interest, value, and effort at t5 as well as self-efficacy and value at t9 . Moreover, for students with an above-average hopelessness, most of the quiz effects on subsequent appraisals were nullified as they became insignificant. Students with an above-average hopelessness for instance do not profit significantly from the selfefficacy-enhancing quiz effect anymore. All quiz effects except for Q4 → E9 and Q4 → I9 become insignificant when students are more than averagely hopelessness, emphasizing the negative deactivating impact of the achievement emotion. The pattern for Q4 → I9 is counterintuitive to the other ones because higher

9.5 Secondary Findings on Group-specific Moderation Effects

335

Table 9.18 Quiz Effect on Expectancy-Value Appraisals at Varying Levels of Course Hopelessness Quiz 1 → S5 (Moderator: Hc 2)

Quiz 4 → S9 (Moderator: Hc 6)

Estimate

p

Estimate

p

Interaction Term

−.479

.218

−.099

.588

Low (−1 SD)

1.146b

.011

.817c

.000

.667c

.001

.718c

.000

.662

.619a

.056

Levels of Course Hopelessness

Average High (+1 SD)

.189

Quiz 1 → I5 (Moderator: Hc 2)

Quiz 4 → I9 (Moderator: Hc 6) Estimate

Estimate

p

Interaction Term

−.613

.143

.125

.503

p

Low (−1 SD)

1.273c

.007

.491b

.033

Average

.660c

.004

.616c

.003

High (+1 SD)

.047

.923

.741b

.020

Quiz 1 → V5 (Moderator: Hc 2)

Quiz 4 → V9 (Moderator: Hc 6)

Estimate

p

Estimate

p

−.164

.448

−.214

.574

Low (−1 SD)

.744

.127

.476a

.067

Average

.530b

.045

.312

.183

.316

.471

.148

.688

Interaction Term

High (+1 SD)

Quiz 1 → E5 (Moderator: Hc 2)

Quiz 4 → E9 (Moderator: Hc 6) Estimate

Estimate

p

Interaction Term

−.444

.246

.349

.100

Low (−1 SD)

1.267c

.008

.412a

.095

Average

.823c

.001

.761c

.001

High (+1 SD)

.379

.368

1.110c

.002

Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.

p

336

9

Multiple Group Causal Analyses

hopelessness students seem to profit more from the quiz effect. However, the interaction effect is very small and insignificant, so that these variations might only suggest that the relation of quiz 4 on subsequent interest depending on different hopelessness levels. For the quiz-effort relationship, the pattern differs at both measurement occasions. A low or average course hopelessness level strengthens the relationship of Q1 → E5, suggesting that students with a higher quiz score more relentlessly plan to invest more effort. For students with an above-average hopelessness level, there is no significant quiz-effort relation anymore, which implies that higher hopelessness undermines performance-based adaptation processes. By contrast, the pattern is reversed for Q4 → E9, so that students with an above-average course hopelessness profit more from the relationship between higher quiz performance and subsequent effort. Hence, students with an aboveaverage course hopelessness seem to be more motivated with increasing quiz scores. Despite the differences in magnitudes, the regression weights are however significant at each of the three hopelessness levels. The differential pattern for both measurement occasions suggests that feedback might be effective under different levels of hopelessness, or rather, that above-average hopelessness might also involve a stimulative, challenging facet as long as it is accompanied by positive performance experiences. Table 9.19 shows the interaction terms and the regression weights under different levels of learning-related hopelessness at t3 . Learning-related enjoyment will be omitted because neither the slopes nor the manifestations in the three different groupings were meaningful. Table 9.19 Quiz Effect on Expectancy-Value Appraisals at Varying Levels of Learning Hopelessness Learning Hopelessness Level

Quiz 1 → S5 (Moderator: HL 3)

Quiz 1 → I5 (Moderator: HL 3)

Estimate

p

Estimate

p

Interaction Term

−.427c

.006

−.220

.179

Low (−1 SD)

.923c

Average

.496c

High (+1 SD)

.068

.000

.541a

.075

.005

.322

.128

.747

.102

.648

Quiz 1 → E5 (Moderator: HL 3)

Quiz 1 → V5 (Moderator: HL 3)

Estimate

Estimate

p

p (continued)

9.5 Secondary Findings on Group-specific Moderation Effects

337

Table 9.19 (continued) Learning Hopelessness Level

Quiz 1 → S5 (Moderator: HL 3)

Quiz 1 → I5 (Moderator: HL 3)

Estimate

p

Estimate

p

Interaction Term

−.252

.126

−.258

.125

Low (−1 SD)

1.077c

.000

.534a

.070

Average

.825c

.000

.276

.208

High (+1 SD)

.573c

.018

.018

.945

Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.

The pattern mostly conforms that of course hopelessness, i.e., an aboveaverage level of frustration renders the feedback effect on subsequent appraisals void. Lower levels of learning hopelessness are thus more beneficial to promote feedback effects. Beyond that, a below-average hopelessness might activate feedback effects that were insignificant under average, ceteris paribus conditions of the CV model, such as Q1 → V5; the multiplicative perspective shows that this impact could be instigated for individuals with a below-average hopelessness. Regarding effort, the effect mechanism is comparable to that with the moderator Hc 2. The persistent significance of Q1 → E5 even for above-average hopeless students again suggests that a certain degree of frustration might still inherit a motivating force. In all, however, only the interaction effect for Q1*HL 3 is significant so that the other patterns can only be generalized with caution despite their consistency across the three levels of hopelessness. Moving from the negative deactivating to the positive activating achievement emotion of enjoyment, Table 9.20 illustrates their mode of action across the three known levels. Even though only one interaction effect was significant (Jc 6×Q4 → V9), the overall pattern suggests that higher levels of course enjoyment fuels the feedback effects on subsequent EV appraisals. For instance, the higher the prior enjoyment level, the higher the self-efficacy-enhancing effect of the quiz feedback. The same applies to most other EV appraisals, except for I9, and E9, which decrease with increasing enjoyment level. Since the feedback effects for these constructs are however significant at all three levels of prior course enjoyment, it could be assumed that they are rather independent of the prior emotional status. Moreover, the feedback effects at a below-average enjoyment level are mostly insignificant and only become significant starting from the average enjoyment level, suggesting that at least an average enjoyment is necessary to profit from the feedback effects. Regarding effort, the differential pattern of effect mechanisms that was found for

338

9

Multiple Group Causal Analyses

Table 9.20 Quiz Effect on Expectancy-Value Appraisals at Varying Levels of Course Enjoyment Course Enjoyment Level

Quiz 1 → S5 (Moderator: Jc 2)

Quiz 4 → S9 (Moderator: Jc 6)

Estimate

Estimate

p

p

Interaction Term

.231

.360

.080

.480

Low (−1 SD)

.452

.134

.698c

.001

Average

.683c

.001

.779c

.000

High (+1 SD)

.914c

.010

.859c

.000

Interaction term

Quiz 1 → I5 (Moderator: Jc 2) Estimate

Quiz 4 → I9 (Moderator: Jc 6)

p

Estimate

p

Interaction Term

.171

.561

−.090

.465

Low (−1 SD)

.445

.237

.648c

.006

Average

.616c

.008

.558c

.002

High (+1 SD)

.787b

.034

.469c

.022

Quiz 1 → E5 (Moderator: Jc 2)

Quiz 4 → E9 (Moderator: Jc 6)

Estimate

p

Estimate

p

−.167

.249

Interaction Term

.214

.451

Low (−1 SD)

.574

.104

.800c

.002

Average

.787c

.001

.633c

.002

.009

.465a

.057

High (+1 SD)

1.001c

Quiz 1 → V5 (Moderator: Jc 2)

Quiz 4 → V9 (Moderator: Jc 6)

Estimate

Estimate

p

p

Interaction Term

.072

.798

.331b

.036

Low (−1 SD)

.498

.187

.020

.939

Average

.570b

.031

.351

.114

High (+1 SD)

.641

.104

.681b

.016

Note. For explanations on the symbolisms, refer to sections 5.2.2 and 5.2.3.

9.5 Secondary Findings on Group-specific Moderation Effects

339

hopelessness reoccurs. At t5 , the positive interaction for the feedback impact on effort suggests that students with a high enjoyment level and higher quiz performance still plan to invest more effort subsequently. This suggests that effort might be rather positively connoted (rather than “exertion” in the narrower sense). At the end of the semester, the relationship is reversed, which could indicate that learners either adopt a different reduce their persistence because they become lethargic in a state of high enjoyment, or that learning feels more effortless in a positive state of enjoyment (Harackiewicz, Smith, et al., p. 221). To shed further light on this mixed pattern, effort could be assessed differently (e.g., in the form of effort cost, or persistence). In all, most interaction effects were insignificant, suggesting that the difference in the slopes between below- and above-average groups are not significant22 . However, the different levels of prior emotions revealed a consistent pattern in such a way that reducing course- and learning-related hopelessness and increasing course enjoyment promotes most of the feedback effects on subsequent EV appraisals. Hence, didactical measures could be used to promote students with unfavorable initial emotional levels, so that they would also profit more from the feedback effects. Learning-related enjoyment was omitted in this chapter since the small and insignificant interaction effects did not contribute meaningfully to the quiz effect. This conforms to the lack of significant interrelations in the additive CV models (section 8.4.2), so that the construct seems to be of less relevance compared to its course counterpart.

22

Group-specific moderation effects for the interactions have also been considered, but were found to be rather consistent among groups or to vary unsystematically.

Discussion and Conclusion

10.1

10

Synoptic Evaluation of the Hypotheses

10.1.1 Do Formative Achievement as well as Achievement Motivation and Emotions Predict each other throughout the Semester? (RQ1) The present study emphasized the importance of investigating learning as a dynamic process that continuously adapts to instructional features of the learning environment. More concretely, the research contributed to the understanding of AME appraisals by investigating their dynamics in the course of one semester by relating their variance to repeated formative assessments, and a final summative assessment. The findings have shown that unsupervised quizzes along with standardized electronic feedback contributes to an adaptation of students’ AME appraisals. Regarding RQ1 and the hypotheses H1-H7, the major findings of this study were that self-efficacy (H1), course enjoyment (H6a), and course hopelessness (H7a) were most consistently reciprocally related with formative achievement over time when considering the group-unspecific models. This finding highlights the relevance of competence beliefs and emotional experience in learning statistics. The reciprocal relation was less slightly less consistent for affect (H4), effort (H5), and learning-related hopelessness (H7b), but still mostly continuous. By contrast, interest and value were the least reciprocally related with formative achievement (H3). For interest and value, the reciprocal relation was limited to the beginning of the semester, whereupon only formative feedback significantly predicted the two appraisals, but not vice-versa. This suggests that the skill development model applies to these two constructs rather than the skill enhancement

© The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 A. Maur, Electronic Feedback in Large University Statistics Courses, https://doi.org/10.1007/978-3-658-41620-1_10

341

342

10

Discussion and Conclusion

model (Burns et al., 2020) because achievement is a stronger driving force for subsequent appraisals than the reverse. A comparison of the reciprocal relations of self-efficacy, interest and value with formative achievement corroborates findings from Wigfield & Cambria (2010, p. 6), suggesting that ability-related beliefs are more predictive of achievement than value-related beliefs. At least, these two value beliefs were responsive to quiz feedback whereas learning-related enjoyment (H6b) and difficulty (H2) hardly ever predicted formative achievement or were predicted by formative achievement. Standardized beta coefficients suggest that in most cases, and particularly at the beginning of the semester, the effects of formative achievement on subsequent appraisal were twice as high as the effect of appraisal on subsequent formative achievement. This is similar to the meta-analytic findings of Talsma et al. (2018) regarding reciprocal linkages of self-efficacy with achievement and indicates that prior performance tends to be a better predictor of subsequent appraisals than vice versa. In all, statistics-related self-efficacy, affect, effort, course enjoyment and hopelessness are meaningfully interrelated with formative achievement throughout the semester and in accordance with the hypotheses. Difficulty may be unrelated to formative achievement due to its reference to more stable perceptions about statistics and will be discussed further below. The less consistent interrelations between interest, value, learning enjoyment, and formative achievement could stem from the feedback characteristics and will be discussed in more detail in section 10.2.

10.1.2 Do Achievement Motivation and Emotions Relate to Summative Achievement? AME appraisals only inconsistently related to final exam grade. Self-efficacy related positively to final exam grade, but the magnitude of regression coefficient was rather low in all models. Regarding the group-specific models, the efficacy-performance relationship was only significant in the FC, for male, and lower proficient students. Difficulty is more strongly related to final exam grade compared to self-efficacy, and significantly so across all groups. The stronger interrelation of difficulty compared to self-efficacy could be due to the fact that students become more acquainted with the statistics-related tasks throughout the semester, so that the relevance of the perceived self-belief makes way for the more task-specific appraisal of difficulty based on prior experiences with the exercises. The finding is similar to Kiekkas et al. (2015), who argued that examinations developed according to the content of the course are better suited to assess

10.1 Synoptic Evaluation of the Hypotheses

343

students’ appraisals throughout the course. Interest and value appraisals did not relate significantly to final exam performance. This corroborates other empirical studies that documented that expectancies for success were more predictive of achievement while value appraisals were more predictive of achievement-related choices and engagement (Guo et al., 2017; Trautwein et al., 2012; Wigfield & Eccles, 2020). Effort interrelated somewhat consistently with formative achievement, but summative achievement was not dependent on prior effort, which reflects the findings from Tempelaar and van der Loeff (2011), who suggested that quiz scores allow for more focused preparation and processing than a final exam under controlled and timed conditions. Tempelaar, van der Loeff, et al. (2007) further argues that effort is more relevant for quiz processing, as quizzes are more related to effort-based surface learning (i.e., a good score can be achieved through due diligence) while exam is more a cognitive performance indicator (appropriate thinking under time pressure). Regarding achievement emotions, hopelessness consistently related to exam performance, except for the gender-specific consideration. Female students seem to let themselves get more carried away by out-of-class frustration when they prepare for the final exam. The enjoyment-performance relationship deserves further scrutiny: Counterintuitively, both course- and learning-related enjoyment is negatively related to final exam score in most of the considered groups. The negative relationship was indeed theorized before. Several researchers argued that positive emotions could render individuals lethargic by conveying that all is going well. This illusion then is assumed to impact metacognitive judgments and to trigger untargeted, task-irrelevant thinking, bringing goal pursuit to halt (Fong et al., 2018, p. 239; Pekrun & Linnenbrink-Garcia, 2012, p. 270; Wortha et al., 2019). Another reason for the absence of more salient relationships with summative achievement could be country-specific effects (i.e., interest, value, effort, enjoyment). In a review of literature, Emmioglu and Capa-Aydin (2012, p. 99) found that the effect sizes of statistics EV appraisals for U.S. countries were double in size compared to non-U.S. countries. The preponderance of private institutions and smaller courses surveyed in the U.S. compared to public institutes in Europe, for instance, were assumed to lead to more pronounced attitudes on part of the U.S. students. Frenzel, Thrash et al. (2007) also found the relations between achievement emotions and achievement in German studies. Finally, the exam score and quiz scores are tailored to the specific statistics course and should not be considered a validly generalizable yardstick for statistical reasoning (González et al., 2016, p. 220). However, as elaborated in section 2.1.4, as of yet, there is not easily transferable, sufficiently broad, standardized instrument to make statistical reasoning comparable across course contexts. This is why the present study

344

10

Discussion and Conclusion

recurred on quiz and exam scores that were specific to the statistics course to ensure student buy-in and sufficient alignment of the cognitive measures with the course learning goals (Pekrun et al., 2014).

10.1.3 Do Feedback-Related Processes Vary according to Gender, Proficiency, and Course Design? (RQ2) After the complete analyses, the generic reciprocal effects were considered in different groups. As regards gender, fortunately, quiz processing does not seem to be largely subjected to sex bias. Gender-related differences in the different component models occurred occasionally and suggest that—as vaguely assumed—male students profit slightly more from feedback with regard to EV appraisals. Regarding self-efficacy and affect, male students profited more from the first quiz feedback while the effect was still significant for female students, too. The gender difference was however insignificant by the end of the semester. The feedback effect on subsequent interest was longer-lasting for male students, while only male students profited consistently from the value-enhancing feedback effect at t5 and t9 . As assumed, female students were more susceptible to the effortenhancing feedback effect—both genders however reacted significantly, and the differences disappeared at the end of the semester. Female students responded more consistently with regard to their emotional appraisals; the reciprocal linkages between course- and learning-related hopelessness and course enjoyment on the one hand, and formative achievement on the other, were more consistent compared to male students. Emotional adaptation to feedback was however also meaningful for male students. In sum, it can be assumed that the gender differences occurring only at the beginning of the semester (e.g., self-efficacy, affect, effort) are likely attributable to statistics-related, stereotypical experiences outside the course leading to a sexdifferentiated approach to unfamiliar subject matter of which the disadvantaged group yet has little experience (Randhawa et al., 1993, p. 46; Schunk & Lilly, 1984, p. 204). This conforms to an earlier study of Schunk and Lilly (1984), who found that initial gender differences in self-efficacy due to beginning insecurities of female students diminished in the course of further feedback iterations. As Pajares puts it, the psychological mechanisms seem to be a function of the stereotypic beliefs an individual holds about gender roles rather than of gender itself, thus rendering stereotypes a self-fulfilling prophecy (2002, p. 119). The diminishing differences between the relationships suggest that, in accordance with Mason and Bruning (2001), computer-mediated feedback can contribute to

10.1 Synoptic Evaluation of the Hypotheses

345

students appraising statistics based on actual task performance rather than gender stereotypes. Thus, feedback potentially contributes to gender-related equity in instruction. Regarding prior knowledge, no differences were found regarding difficulty, affect, and effort and formative feedback. The self-efficacy-enhancing feedback effect was similar for low and high proficient students, but lasted until the end of the term only for high proficient students. A more pronounced difference was found for value appraisals; only high proficient students profited significantly from the interest- and value-enhancing quiz effect. For once, the interestenhancing quiz effect was significant at the 10%-level for low proficient students. Only high proficient students were responsive to the enjoyment-enhancing feedback effect at the beginning of the semester while both groups profited weakly at the end of the semester. There were no consistently different patterns for courseand learning-related hopelessness. The expertise-related differences, in all, provide a mixed picture. On the one hand, differences in feedback effects were mostly to the disadvantage of low proficient students and mostly concerned motivational constructs (particularly regarding self-efficacy, interest, and value). On the other hand, the pattern of feedback effects for the affective-emotional constructs was fairly consistent or varied unsystematically (i.e., affect, enjoyment, hopelessness). Thus, it seems that regardless of prior proficiency, students react emotionally according to the experiences gained within the statistics course in a similar way (Zimmerman & Moylan, 2011, p. 308). However, it has to be borne in mind that expertise-specific mean differences become more pronounced throughout the semester to the disadvantage of low proficient students on all investigated constructs. Hence, prior mathematics experiences still loom large in students’ AME appraisals when attending a first statistics course (Gal & Ginsburg, 1994). This is despite the fact that the target statistics course is situated in a not heavily math-oriented economics degree course. Instructors should therefore take account of students’ prior proficiency and inclinations (Guo et al., 2017, p. 81). Moreover, when considering more immediate prior formative achievement in the context of the statistics course, it should be noted that students with a lower quiz score profited less from the beneficial linear effects on subsequent AME appraisals. These effects will be discussed in more depth in section 10.1.5. Finally, design-specific differences suggest that only students from the FC profit from the efficacy-enhancing quiz 1 effect while there were no designspecific differences for self-efficacy at the end of the semester. The impact of formative achievement on subsequent affect was higher in the FC at the beginning of the semester, but still significant for both designs, while both effects were

346

10

Discussion and Conclusion

insignificant at the end of the semester. Another interesting finding is that prior affect was consistently stronger positively related to formative achievement in the TC compared to the FC (except for Q4). This corroborates findings from Cassady et al. (2001) as well as Núñez-Peña et al. (2015), indicating that anxiety on part of students could stem from a lack of frequent preparation. The assumption that the FC motivates early engagement with the content, whereby students built up mastery experience, may have contributed to an attenuation of (positive or negative) predispositions with which students approach quizzes. The value and interest-enhancing effects were stronger in the FC than in the TC. While the value-enhancing effect was still significant in both designs, only flipped students profited from the interest-enhancing effect throughout the semester. In a similar vein, enjoyment, as a quite similar construct related to intrinsic motivation, was only significant in the FC in relation to formative achievement. When considering cognitive evaluation theory, the finding corroborates that, although informationally provisioned feedback can be expected to generally enhance intrinsic motivation, a less controlling environment, like the FC, reinforces this effect (Deci et al., 2001, p. 9). Hence, while many other intervention studies only had limited success in fostering achievement emptions (Pekrun, 2006, p. 337), the present study suggests that the FC could be a first step in creating an emotionally sound environment. Course hopelessness did not vary systematically across designs, while flipped students reacted more strongly to feedback regarding their learning-related hopelessness. The reciprocal linkage of learning hopelessness was more consistent in the TC. Interestingly, the only feedback effect that was more pronounced in the TC was the effort-enhancing impact of quiz 1. The difference could stem from the fact that students, particularly at the beginning of the traditional semester, were not sufficiently encouraged to be active, so that they felt that their effort boosted after the first quiz, which was the first time they were explicitly required to become active. By contrast, students in the flipped semester may have had a greater baseline effort at the beginning of the semester due to preparing the f2f sessions and thus found the quiz not considerably more constraining. By the end of the semester, the difference between both effort effects disappeared. When considering the considerable heterogeneity on the FC efficacy related to AME appraisals and performance (Ranellucci et al., 2021), the results of the present study provide a fairly consistent picture: formative feedback significantly contributes to the enhancement of most AME appraisals in the FC, compared to the TC. Different to most other studies, which compared course setups in general, this study thus identified quizzes as critical feature of the FC to promote students’ motivational and emotional well-being.

10.1 Synoptic Evaluation of the Hypotheses

347

Taking all three group comparisons into consideration, the only relations that were systematically disadvantaged were interest and value; the interrelations were weaker for male and low proficient students and students of the FC. Potential measures to foster interest and value will therefore be discussed in section 10.2. Apart from these systematic group-specific effects, the general consistency of effects across groups are in accordance with the CV proposition of relative universality of psychological functions across individuals (Frenzel, Trash, et al., 2007; Pekrun, 2006, p. 324). The assumption predicts that, despite the prevalence of mean differences of AME appraisals across gender, proficiency, or learning environments, the functional mechanisms in relation to antecedents (i.e., feedback) and outcomes are equal on average and in general (Loderer et al., 2020).

10.1.4 Do Expectancy-Value Appraisals Synergistically Predict Formative Achievement and Achievement Emotions? The data provide vague indications of multiplicative EV effects, even though they were mostly weak and not consistent throughout the semester. Multiplicative EV appraisals related most significantly and consistently with learning-related enjoyment. Students with a higher level of self-efficacy profited more the value- and interest-enhancing enjoyment effects. Since learning-related enjoyment was found to be less meaningful in the reciprocal relations with formative achievement, it could be assumed that both higher expectancy and value appraisals are equally required for enjoyment to play a more prominent role in the feedback process. Moreover, these interaction effects were more pronounced in the FC and for high proficient students. In the TC, only above-average self-efficacious students profited from the enjoyment-enhancing interest and value effects while in the FC, students with a lower self-efficacy already profited. This finding suggests that less perceived controllability is less detrimental in the FC for enjoyment enhancement compared to the TC. By the end of the semester, only higher proficient students profited from the enjoyment enhancing interest- and value effects. The higher relevance of the interaction between EV and enjoyment in the FC likely stems from the greater need satisfaction in the context of self-determination and the greater flow experience. In a state of aroused interest, it can be expected that cognitive appraisals, perceived value, as well as affective reactions intertwine more strongly to promote well-being during learning (Harackiewicz, Smith, et al., 2016, p. 221). When considering moderation effects in the different designs, male students in the FC profited more from the feedback effects on self-efficacy (but not significantly) and on learning-related hopelessness (significantly) compared to female

348

10

Discussion and Conclusion

students in the FC. Female students profited more from the feedback effects on subsequent interest (not significantly), effort (not significantly), and course enjoyment (significantly at t4 ) in the FC compared to male students in the FC. This suggests that female students benefit more from the FC regarding intrinsic factors, whereas male students profit more from the FC regarding their expectancies for success. High proficient students were more susceptible to interest- and valueenhancing feedback effects in the FC compared to low proficient students in the FC. While it seems that group-specific differences between the effects were more prevalent in the FC, it has to be considered that many of these relations were absent or attenuated in the TC for both groups, so that the FC should not be seen as systematically disadvantageous for specific groups. Finally, the quiz effects on students’ EV appraisals were shown to be higher for students with aboveaverage course enjoyment and with below-average course- and learning related hopelessness. This finding suggests that feedback is better accepted by students with higher positive emotions compared to students with more negative emotions. Even though the interaction effects often lack significance and consistency, they point to the fact that instructional measures or whole-class interventions should equally target AME appraisals and not focus on a more limited number of constructs (Lauermann et al., 2017, p. 1557). For instance, now knowing that achievement emotions can further increase feedback effects, an intervention focusing on self-efficacy alone might not unfold its full potential for students with low achievement emotions. The present study accordingly suggests that implementing feedback in a conventional lecture may still produce nonoptimal results (as shown in the partly attenuated and absent effects on self-efficacy), while the FC, offering further affective benefits in fostering interest, value, and enjoyment, might also have contributed to make students more responsive regarding feedback processing, thus strengthening the other motivational effect mechanisms (Parr et al., 2019, p. 653).

10.1.5 Matthew Effects and Decreasing Salience of Feedback Effects Finally, two consistent patterns worthy of discussion emerged from the data. First, the linear feedback effects suggest that students with a higher prior quiz score profit more from its beneficial effect than students with a lower score. This conforms to the state of research elaborated in section 3.2 and 3.3 and suggests the existence of Matthew effects according to which students who perform well in the quizzes, receive positive feedback, which promotes even greater enhancement

10.1 Synoptic Evaluation of the Hypotheses

349

of motivation and emotion, which then again positively relates to subsequent performance, and so on (Marshman et al., 2018; Razzaq et al., 2020, p. 263). This pattern suggests that students achieving a low quiz score resist on an entity theorist approach (Schunk, 1991), whereby the worse the performance, the more self-confirmation individuals receive for sticking to these negative self-beliefs, thus reacting unfavorably in terms of their motivation. By contrast, more positive feedback seems to trigger incremental perspectives in which ability, no matter how high, is seen as opportunity to set even higher goals for further competence enhancement and mastery (Locke & Latham, 2002, p. 708). Pekrun (2007, p. 590) rightfully refers to these patterns as virtuous or vicious cycles for high and low achievers, respectively. An individual who is trapped in the failure culminating cycle is endangered to experience learned helplessness, whereby entity theorists adjust achievement goals downwardly (Fong et al., 2018, p. 239; Peixoto et al., 2017, p. 389; Zingoni & Byron, 2017, p. 53). These negative dispositions will then negatively relate to subsequent performance and thus become self-fulfilling (Peterson et al., 2015, p. 93). This also suggests the existence of a self-consistency pattern, whereby low self-efficacious students, for instance, refrain from embracing further opportunities for knowledge enhancement (Swann et al., 1987). This stands in contrast to the control theorical pattern underlying Carver and Scheier feedback model (2000; see section 3.1.1), which assumed that individuals with low prior performance would be more motivated and persistent to reduce such discrepancies (Richard et al., 2006, p. 69). Practical measures to avoid these tendencies are discussed in section 10.2. The second recurring pattern was that, for all considered motivational and emotional constructs, the interrelations with formative achievement faded to a certain degree by the end of the semester. Theoretical and empirical findings offer several explanations for this. Swann et al. (1987, p. 887) argue that, with increasing experience of the working memory in a subject domain, perturbations from the affective system are expected to make way for the more reflective cognitive system (see also Austin & Croizet, 2012, p. 616). Cassady and Gridley (2005) go on to argue that the adaption of more useful study strategies after practice tests may lead to a reduction in the effects of test perception. The fading affective reactions to feedback in this study offer some support for these notions as they suggest that between-person differences in the feedback-appraisal interrelations consolidate throughout the semester while prior performance in the quizzes (i.e., practice effects) remains relevant (Gist & Mitchell, 1987, p. 474; Yeo & Neal, 2006, p. 1091). Starkey-Perret et al. (2018, p. 460) found a similar amortization effect for achievement emotions in middle school students and argued that the decrease in emotional reactions might be due habituation over time. This corroborates the

350

10

Discussion and Conclusion

assumption that emotional appraisals can become routinized and non-reflective over time. This implies that the situational perception and emotional appraisal become close-knit, so that students are not necessarily consciously aware of a concrete emotional state while learning (Pekrun & Linnenbrink-Garcia, 2012; Pekrun & Stephens, 2010). Nichols and Dawson (2012, p. 468) ascribed this attenuation to a developmental trend of disillusionment in view of approaching test-related pressures, such as summative examinations at the end of the semester, which lead to a more generally more neutral mood. Regarding the stabilization of intrinsic value, Harackiewicz, Smith, et al. assume that, while interest and value beliefs deepen across the initial phases of knowledge acquisition in a specific domain, it can also go dormant when it is not increasingly stimulated by means of external support (2016, p. 221). From the solidifying patterns of AME appraisals, it can be concluded that feedback should be reinforced in terms of quantity and quality to maintain a stronger relationship between performance and subsequent appraisals. Didactical propositions for that will be made in section 10.2.

10.2

Practical Implications and Future Directions

10.2.1 The Necessity for Scaling up Formative Feedback in Higher Education This study contributed to research in statistics education by showing that the implementation of formative feedback in TCs and FCs is an appropriate measure to foster students’ AME appraisals. Hence, the findings go one step further beyond other studies which only concluded their investigations with the assumption that interventions might be adequate to boost statistics attitudes (e.g., Nolan et al., 2012, p. 120). On grounds of the mostly beneficial impacts of formative assessments in various group contexts, a case can be made for recommending scaling up electronic quizzes in higher education. Instructors should implement more low-stakes, mastery-focused assessments into their courses at regular intervals (Atiq & Loui, 2022, p. 23; Hood et al., 2021). With most learning management systems nowadays providing the necessary technical infrastructure of modules for assessment, electronic quizzes can be implemented into any existing curricular structure regardless of the subject domain (Evans et al., 2021, p. 175). Conceptualizing standardized tests is a one-time expenditure for setting up a pool of exercises, which can be used repeatedly and flexibly for an infinite number of participants thereafter and saves correction time due to automated evaluation (Cassady & Griley, p. 25; Enders et al., 2021, p. 92; Evans et al., 2021, p. 175).

10.2 Practical Implications and Future Directions

351

These assessments were shown to create a safe space for students to overcome demotivating and negative states of mind by building up mastery experience. The present study has shown that even “sparse” feedback with a knowledge of the correct result consisting of numeric scores, which do not carry any valence per se (Lipnevich et al. 2021; van de Ridder et al., 2014, p. 804), can impact AME appraisals. This alludes to a great potential to further embellish and elaborate the feedback messages to promote even stronger and more consistent reciprocal effects. For instance, the absence of information to increase the value of the feedback may be the reason why effects on interest, value, and enjoyment, in particular, had a lower salience than those on expectancies for success. This is aggravated by the fact that several groups of students were disadvantaged by not profiting from these effects (i.e., female students, low proficient students, and flipped students). Therefore, a didactical modification of the formative assessments should be considered to make quizzes more appealing to intrinsic motivation. The inclusion of relevance-inducing tasks, exercises to self-reflect the usefulness of statistics, or explicit information exemplifying the task relevance, along with a guidance for their beneficial processing could help students to reappraise the value of the feedback more positively (Acee & Weinstein, 2010, p. 490; Evans et al., 2013, p. 90; Gaspard et al., 2015, p. 1227; Zingoni & Byron, 2017, p. 61). This is even more relevant for subjects such as statistics, for which the usefulness in later professional life might not be evident for students at the beginning of their study (Kiekkas et al. 2015, p. 1286). Even though course enjoyment was reciprocally related with formative achievement, the standardized coefficients suggest that formative achievement is less predictive of enjoyment compared to other appraisals, such as hopelessness, self-efficacy, affect, or effort. A possible approach for didactic optimization could be to a rebuilt of the feedback system to make it more emotionally rewarding (Harley et al., 2017, p. 288). Research has shown that gamification succeeds in fostering enjoyment by means of agents (i.e., avatars), narrative elements, visualizations, or bonus systems (Lipnevich et al., 2021). More elaborate feedback could also be used for reattribution of causal ascriptions of failure due to unproficiency to failure due to a lack of motivation and effort (Pekrun, 2006, p. 336; Schunk, 1989, p. 180). A reattribution to effort as an easier amenable disposition could prevent students who had received a lower quiz score from reacting overly negative with a downward adaptation of their AME appraisals as found in the current data pattern (see section 10.1). Instead, it might help those students to also benefit from the feedback and promote sustained engagement by rendering the negatively appraised feedback more controllable (Arguel et al., 2019, p. 205; González et al.,2016, p. 219). Such a

352

10

Discussion and Conclusion

motivational reattribution could be done by giving additional information on the wrong answers and a concrete reference to the solution processes in the course materials to elicit goal discrepancies (Enders et al., 2021, p. 93; Narciss, 2008, p. 139). This information transmits that course challenges are less costly, and that the students can catch up on their mistakes through mastery orientation (Rosenzweig et al., 2020). A sound basis for such elaborate, item-specific feedback for standardized assessments could be built by collecting research on common statistics misconceptions, or by using evidence-based solution frequencies from old summative or formative assessments of the respective courses. Another important finding in the linear feedback effects was that students with a higher score in the formative assessment profit more from motivational and emotional enhancement, while students with a lower score profit less. Hence, the question arises on how to appeal to the low achievers who are “trapped” in the vicious circle of downward motivational and emotional adaptions (Acee & Weinstein, 2010, p. 489; Evans, 2013, p. 103; Rosenzweig & Wigfield, 2016, p. 153). As Schunk and Ertmer put it, lower formative performance does not have to undermine one’s sense of self-efficacy if students are convinced to believe that they are capable of learning through self-regulatory adaptations (2000, p. 637). While elaborate feedback as a way to promote adaptive attributions was already mentioned above, another viable strategy could lie in the technical implementation of adaptive feedback. Adaptive feedback provides a more personalized learning experience, giving students more manageable tasks that are individually tailored to their performance or misconceptions from prior exercises in terms of their difficulty (Harackiewicz, Smith, et al., 2016, p. 224; Harley et al., 2017, p. 282). Students that would otherwise be stuck in the demotivating downward spiral, could be provided with easier tasks first as motivational anchors to maintain initial task involvement for both low and high proficient students (Butler & Winne, 1995; Schunk, 1989, p. 186). Apart from adapting the order and selection of task, adaptive learning environments can also provide additional information, if needed by the student, to be pointed to the correct solution (Harley et al., 2017, p. 282), thus coming close to a natural instructional process. Moreover, when the answer of a student suggests a specific misconception, he could be given an additional exercise that tackles and problematizes this concept. In all, building on the assumption that electronic feedback per se is suitable to foster achievement motivation and emotions, most learning systems offer sheer unlimited options to further optimize formative assessments based on the above recommendations. The data also suggested that the FC is an adequate instructional medium to increase students AME appraisals in conjunction with formative feedback. For instructors tempting to test the course design in which to embed

10.2 Practical Implications and Future Directions

353

the feedback prior to a whole-class implementation, parts of the course could be considered for a micro-flip as suggested by Fidalgo-Blanco et al. (2017). For instance, certain topics of the syllabus could be flipped which are particularly suited for relocation outside the class, such as unchanging theoretical topical introductions. A more general implication from the above-mentioned benefits of flipped teaching is to provide instructional features and materials that give students autonomy to self-regulate and to increase opportunities for achieving mastery as well as success.

10.2.2 Methodological Considerations and Limitations of the Present Study Apart from the above-mentioned practical extensions, this study also has methodological limitations that could be addressed in future follow-up research. For starters, a limitation of the present study therefore was that the provision of a numeric score was equaled to the reception of the feedback while ignoring the extent to which this feedback might have impelled specific adaptations in the learning process. Hence, it remains unclear whether, for instance, the participation in the assessment activities (i.e., recall and retrieval of relevant information), or the provision of the feedback itself was crucial for motivational and emotional enhancement (Morris et al., 2021). In future studies, causal attributions along with individuals’ interpretations of their outcomes should therefore be factored in above and beyond the actual formative outcomes (Eccles & Wigfield, 2002, p. 117). In a similar vein, AME appraisals were related to the perceptions towards the domain of statistics in general and the statistics course, but not towards the received feedback in particular. When assessing concrete attitudes towards the feedback, it could be assumed that the magnitude of the effect mechanisms then becomes even more evident (Zingoni & Byron, 2017, p. 52). This is why future studies should capture students’ specific feedback perceptions to account for the proactive recipience, i.e., their state of mind with which they engage in feedback uptake (as in, e.g., Adams et al., 2019; Beile & Boote, 2004). In that regard, qualitative interviews could also be conducted with a subsample of students to interrogate the feedback and problem solution processes in greater depth. A reduction of the temporal distance between the assessment of feedback perceptions and the feedback reception itself is also expected to generate more salient effects (Evans, 2013, p. 93), whereas in the present study, sometimes two to three weeks had passed after feedback reception before AME appraisals were assessed again. Even though the appraisals were assessed in the immediate course

354

10

Discussion and Conclusion

and learning contexts, emotions during class or while revising lecture notes may be quite different from those before or after feedback reception. A feasible method for such an assessment is experience sampling, which has already been field-tested by several researchers (e.g., Rausch et al., 2019). The validated scales of the present study could thus be assessed while controlling for the different course situations, e.g., during an input phase, task processing, and immediately after feedback reception. That way, the researcher receives an ecologically valid appraisal from a more immediate motivational and emotional experience, reducing retrospective bias (Neroni et al., 2019, p. 6; Peterson et al., 2015, p. 85; Seifried & Sembill, 2005, p. 661). Such an approach would also be helpful to assess learning activities in the flipped and TC context. While the present study had to resort to assumptions of heightened activity and preparation in the FC, experience sampling could allow to assess and control such activities as well in the form of diary entries, for instance. Finally, immediate assessment of students’ appraisals could help understand how the numeric feedback scores given in the present study are concretely and individually received (Adams et al., 2020; Lipnevich et al., 2021). For instance, a score of 70% could be received in a motivating and demotivating way depending on the individual expectations and ambitions. Moreover, the present, observational study was conducted in an ecologically valid and naturalistic learning environment, which renders it difficult to control all factors under which learning occurs (Enders et al., 2021, p. 93). This could also be a reason for the decreasing magnitude of the feedback effects on AME by the end of the semester. However, since educational target variables exist in a highly interdependent framework with various internal and external factors, a certain degree of consistency can already suggest meaningful patterns for educational practice (Kiekkas et al., 2015; Loderer et al., 2020; Talsma et al., 2018, p. 144). With regard to the present study, the smaller magnitude of the effects could also be due to the fact that computer-based assessments per se are perceived less judgmental because of the absence of teacher personality and interpersonal action in a classroom (Bangert-Drowns et al., 1991, p. 215; Goetz et al., 2018, p. 558; Mason & Bruning, 2001). On a related note, and in spite of the longitudinal design, which controlled for autoregressive effects and numerous learning-related constructs, it cannot be ruled out that other predictor variables influenced the feedback process, which limits causative conclusions (Neroni et al., 2019, p. 6; Pekrun et al., 2017, p. 1667). An alternative laboratory study, by contrast, is limited to specific contexts only, may overtly demonstrate causality to the participants, and forfeits statistical power due to smaller sample sizes (Khanna et al.,

10.2 Practical Implications and Future Directions

355

2015, p. 174; Pekrun et al., 2017, p. 1667). The above-elaborated experience sampling method could be a first step in balancing out the pros and cons between field and laboratory studies by assessing more learning-related variables in their rightful and immediate context. Despite these restrictions, causality of the interrelations in the present longitudinal study is at least more ascertained as compared to the mostly prevalent cross-sectional findings on the aggregate level. Finally, the interpretations of the present study are limited by the fact that between- and within-person variance is not partialled out, so that within-person variance is assumed to be stable across time (Burns et al., 2020, p. 79; Gist & Mitchell, 1992, p. 199). Conventional cross-lagged panel models do not control for individual dispositional base levels and fluctuations, thus precluding inferences on personal improvement in motivation, emotion, and achievement over time by means of feedback mechanisms. One possible methodical approach for this would be the random intercept cross lagged panel model (RI-CLPM; Burns et al., 2020), in which two additional factors decompose the between-person and within-variance by means of latent factors1 . While the autoregressive coefficients in CLPM are parameters of rank order stability based on the subjects’ deviation from the group mean, they reflect the likelihood of personal change based on deviations in an individual’s average score at the prior occasion in RI-CLPM. The within-person perspective would also allow for an identification of differential efficacy-performance relationships before and after practice (Yeo & Neal, 2006, p. 1088). Despite these limitations, the present study contributed to the empirical research on electronic formative assessment in large lectures through the lens of the CV theory and in view of the growing digitalization. It provides empirical substantiation for the relevance of the learning environment component (i.e., the efficacy of feedback and FC settings) within the CV framework, on which future studies can build under consideration of the proposed practical and methodical extensions.

1

Another possible modeling approach to account for within-person variance would be latent change modeling.

Bibliography

Aberson, C. L., Berger, D. E., Healy, M. R., Kyle, D. J., & Romero, V. L. (2000). Evaluation of an interactive tutorial for teaching the central limit theorem. Teaching of Psychology, 27(4), 289–291. https://doi.org/10.1207/s15328023top2704_08 Abeysekera, L., & Dawson, P. (2015). Motivation and cognitive load in the flipped classroom: definition, rationale and call for research. Higher Education Research & Development, 34(1), 1–14. Acee, T. W., & Weinstein, C. E. (2010). Effects of a value-reappraisal intervention on statistics students’ motivation and performance. The Journal of Experimental Education, 78(4), 487–512. https://doi.org/10.1080/00220970903352753 Adams, A.-M., Wilson, H., Money, J., Palmer-Conn, S., & Fearn, J. (2019). Student engagement with feedback and attainment: the role of academic self-efficacy. Assessment & Evaluation in Higher Education, 45(2), 317–329. https://doi.org/10.1080/02602938. 2019.1640184 Adamson, K. A., & Prion, S. (2012). Making sense of methods and measurement: validity assessment, part 2. Clinical Simulation in Nursing, 8(8), e383–e384. https://doi.org/10. 1016/j.ecns.2012.07.002 AERA, APA, & NCME (2014). Standards for Educational and Psychological Testing: National Council on Measurement in Education. American Educational Research Association. Ahmed, W., van der Werf, G., Kuyper, H., & Minnaert, A. (2013). Emotions, self-regulated learning, and achievement in mathematics: A growth curve analysis. Journal of Educational Psychology, 105(1), 150–161. https://doi.org/10.1037/a0030160 Ainsworth, S., & Loizou, A. (2003). The effects if self-explaining when learning with text or diagrams. Cognitive Science, s27(4), 669–681. https://doi.org/10.1016/S0364-021 3(03)00033-8 Ajzen, I. (1991). The theory of planned behavior. Organizational Behavior and Human Decision Processes, 50, 179–211. https://doi.org/10.1016/0749-5978(91)90020-t. AlJarrah, A., Thomas, M., & Shebab, M. (2018). Investigating temporal access in a flipped classroom: procrastination persists. International Journal of Educational Technology in Higher Education, 15, 9–17. https://doi.org/10.1145/2090116.2090118

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2023 A. Maur, Electronic Feedback in Large University Statistics Courses, https://doi.org/10.1007/978-3-658-41620-1

357

358

Bibliography

Allen, K., Rhoads, T. R., Murphy, T., & Stone, A. (2004). The statistics concepts inventory: Developing A valid and reliable instrument [Paper presentation]. 2004 annual conference, Salt Lake City, Utah. https://doi.org/10.18260/1-2--13652 Anderson, L.W., & Krathwohl D.R. (2001). A Taxonomy for Learning, Teaching and Assessing. Longman. Angus, S. D., & Watson, J. (2009). Does regular online testing enhance student learning in the numeral sciences? Robust evidence from a large data set. British Journal of Educational Technology, 40(2), 255–272. Arens, A. K., Frenzel, A. C., & Goetz, T. (2022). Self-concept and self-efficacy in math: longitudinal interrelations and reciprocal linkages with achievement. The Journal of Experimental Education, 90(3), 615–633. https://doi.org/10.1080/00220973.202.1786347 Arguel, A., Lockyer, L., Kennedy, G., Lodge, J. M., & Pachman, M. (2019). Seeking optimal confusion: a review on epistemic emotion management in interactive digital learning environments. Interactive Learning Environments, 27(2), 200–21. https://doi.org/10. 1080/1049482.2018.1457544 Asarta, C., & Schmidt, J. (2013). Access patterns of online materials in a blended course. Decision Sciences Journal of Innovative Education, 11(1), 107–123. https://doi.org/10. 1111/j.1540-4609.2012.00366.x Ashford, S., Edmunds, J., & French, D. P. (2010). What is the best way to change selfefficacy to promote lifestyle and recreational physical activity? A systematic review with meta-analysis. British Journal of Health Psychology, 15(Pt 2), 265–288. https://doi.org/ 10.1348/135910709X461752 Asparouhov, T., & Muthén, B. (2008, May 5). Auxiliary Variables Predicting Missing Data. https://www.statmodel.com/download/AuxM2.pdf Asparouhov, T., & Muthén, B. (2012, November 14). Comparison of computational methods for high dimensional item factor analysis https://www.statmodel.com/download/HighDi mension11.pdf Atiq, Z., & Loui, M. C. (2022). A Qualitative Study of Emotions Experienced by Firstyear Engineering Students during Programming Tasks. ACM Transactions on Computing Education, 22(3), 1–26. https://doi.org/10.1145/3507696 Autin, F., & Croizet, J.-C. (2012). Improving working memory efficiency by reframing metacognitive interpretation of task difficulty. Journal of Experimental Psychology: General, 141(4), 610–618. https://doi.org/10.1037/a0027478 Avar, Z., & Sadi, Ö. (2020). The relationship between students’ perceptions of learning environment and achievement emotions: A multivariate analysis. FIRE: Forum for International Research in Education, 6(2), 125–14. Awang, Z. (2014). A handbook of SEM. MPWS Publisher. Azorlosa, J. L. (2011). The effect of announced quizzes on exam performance: II. Journal of Instructional Psychology, 38(1), 3–7. Azzi, A. J., Ramnanan, C. J., Smith, J., Dionne, É., & Jalali, A. (2015). To quiz or not to quiz: Formative tests help detect students at risk of failing the clinical anatomy course. Anatomical Sciences Education, 8(5), 413–42. https://doi.org/10.1002/ase.1488 Babakus, E., Ferguson, C. E., & Jöreskog, K. G. (1987). The sensitivity of confirmatory maximum likelihood factor analysis to violations of measurement scale and distributional assumptions. Journal of Marketing Research, 24(2), 222–228.

Bibliography

359

Bacon, D., & Stewart, K. (2006). How fast do students forget what they learn in consumer behavior? A longitudinal study. Journal of Marketing Education, 28(3), 181–192. https:// doi.org/10.1177/0273475306291463 Bagozzi, R. P., & Baumgartner, H. (1994). The evaluation of structural equation models and hypotheses testing. In R. P. Bagozzi (Eds.), Principles of marketing research (pp. 386– 422). Blackwell. Bagozzi, R. P., & Yi, Y. (1988). On the evaluation of structural equation models. Journal of the Academy of Marketing Science, 16, 74–94. Bagozzi, R. P., & Yi, Y. (1991). Multitrait-multimethod matrices in consumer research. Journal of Consumer Research, 17, 426–439. Bagozzi, R. P., & Yi, Y. (2012). Specification, evaluation, and interpretation of structural equation models. Journal of the Academy of Marketing Science, 40, 8–34. https://link. springer.com/article/10.1007/s11747-011-0278-x Bahrick H. P., & Hall L. K. (2005). The importance of retrieval failures to long-term retention: a metacognitive explanation of the spacing effect. J. Mem. Lang, 52, 566–577. https://doi.org/10.1016/j.jml.2005.01.012 Baker, R. M., & Dwyer, F. (2000). A meta-analytic assessment of the effect of visualized instruction. International Journal of Instructional Media, 27(4), 417– 426. Baker, J. P., & Goodboy, A. K. (2019). The choice is yours: the effects of autonomysupportive instruction on students’ learning and communication. Communication Education, 68(1), 80–102. https://doi.org/10.1080/03634523.2018.1536793 Balaban, R. A., Gilleskie, D. B., & Tran, U. (2016). A quantitative evaluation of the flipped classroom in a large lecture principles of economics course. The Journal of Economic Education, 47(4), 269–287. https://doi.org/10.1080/00220485.2016.1213679 Balo˘glu, M. (2003). Individual differences in statistics anxiety among college students. Personality and Individual Differences, 34(5), 855–865. https://doi.org/10.1016/s0191-886 9(02)00076-4 Bälter, O., Enström, E., & Klingenberg, B. (2013). The effect of short formative diagnostic web quizzes with minimal feedback. Computers & Education, 60(1), 234–242. https:// doi.org/10.1016/j.compedu.2012.08.014 Bandura, A. (2015). Self-regulation of motivation and action through internal standards and goal system. In L. A. Pervin (Eds.), Goal concepts in personality and social psychology (pp. 19–86). Psychology Press. Bandura, A. (1986). The explanatory and predictive scope of self-efficacy theory. Journal of Social and Clinical Psychology, 4(3), 359–373. https://doi.org/10.1521/jscp.1986.4.3.359 Bandura, A. (1997). Self-efficacy: The exercise of control. W.H. Freeman. Bandura, A., & Locke, E. A. (2003). Negative self-efficacy and goal effects revisited. Journal of Applied Psychology, 88(1), 87–99. https://doi.org/10.1037/0021-9010.88.1.87 Banfield, J., & Wilkerson, B. (2014). Increasing student intrinsic motivation and self-efficacy through gamification pedagogy. Contemporary Issues in Education Research (CIER), 7(4), 291–298. https://doi.org/10.19030/cier.v7i4.8843 Bangert-Drowns, R. L., Kulik, C.-L. C., Kulik, J. A., & Morgan, M. T. (1991). The instructional effect of feedback in test-like events. Review of Educational Research, 61(2), 213–238. Baroody, A. J., & Ginsburg, H. P. (2013). The relationship between initial meaningful and mechanical knowledge of arithmetic. In J. Hiebert (Ed.), Conceptual and Procedural

360

Bibliography

Knowledge: The Case of Mathematics (pp. 75–112). Taylor and Francis. https://doi.org/ 10.4324/9780203063538-9 Barron, K. E., & Hulleman, C. S. (2015). Expectancy-Value-Cost Model of Motivation. In J. D. Wright (Ed.), International encyclopedia of the social & behavioral sciences (pp. 503– 509). Elsevier. https://doi.org/10.1016/B978-0-08-097086-8.26099-6 Bastian, C. C. von, & Eschen, A. (2016). Does working memory training have to be adaptive? Psychological Research, 80(2), 181–194. https://doi.org/10.1007/s00426-015-0655-z Bateiha, S., Marchionda, H., & Autin, M. (2020). Teaching Style and Attitudes: A Comparison of Two Collegiate Introductory Statistics Classes. Journal of Statistics Education, 28(2), 154–164. https://doi.org/10.1080/10691898.202.1765710 Bates, S., & Galloway, R. (2012). The inverted classroom in a large enrolment introductory physics course: a case study. Higher Education Academy. https://www2.ph.ed.ac.uk/~rga llowa/Bates_Galloway.pdf Baumert, J., & Kunter, M. (2006). Stichwort: Professionelle Kompetenz von Lehrkräften. Zeitschrift für Erziehungswissenschaft, 9(4), 469–52. https://doi.org/10.1007/s11618006-0165-2 Beatson, N. J., Berg, D. A. G., & Smith, J. K. (2018). The impact of mastery feedback on undergraduate students’ self-efficacy beliefs. Studies in Educational Evaluation, 59, 58– 66. https://doi.org/10.1016/j.stueduc.2018.03.002 Bechrakis, T., Gialamas, V., & Barkatsas, A. N. (2011). Survey of Attitudes Toward Statistics (SATS): An investigation of its construct validity and its factor structure invariance by gender. International Journal of Theoretical Educational Practice, 1(1), 1–15. Beile, P. M., & Boote, D. N. (2004). Does the medium matter? A comparison of a Web-based tutorial with face-to-face library instruction on education students’ self-efficacy levels and learning outcomes. Research Strategies, 20(1–2), 57–68. https://doi.org/10.1016/j.res str.2005.07.002 Ben-Zvi, D. (2018). Three paradigms to develop students‘ statistical reasoning In M. A. Sorto, A. White, & L. Guyot (Eds.), Looking back, looking forward. Proceedings of the Tenth International Conference on Teaching Statistics. International Statistics Institute. Berg, B. (2021). SDAPS—Imprint. https://sdaps.org/imprint Berweger, B., Born, S., & Dietrich, J. (2022). Expectancy-value appraisals and achievement emotions in an online learning environment: Within- and between-person relationships. Learning and Instruction, 77. https://doi.org/10.1016/j.learninstruc.2021.101546 Bhansali, A., & Sharma, M. D. (2019). The Achievement Emotions Questionnaire: Validation and implementation for undergraduate physics practicals. International Journal of Innovation in Science and Mathematics Education, 27(9), 34–46. Biggs, J. B., & Collis, K. F. (1982). Origin and description of the SOLO taxonomy. In Evaluating the quality of learning (pp. 17–31). Elsevier. https://doi.org/10.1016/b978-0-12097552-5.50007-7 Bishop, J., & Verleger, M. (n.d.). The flipped classroom: A survey of the research. In 2013 ASEE annual conference & exposition. ASEE Conferences. https://doi.org/10.18260/12--22585 Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5(1), 7–74. https://doi.org/10.1080/096959598005 0102

Bibliography

361

Boekaerts, M., & Cascallar, E. (2006). How far have we moved toward the integration of theory and Practice in Self-Regulation? Educational Psychology Review, 18(3), 199–210. https://doi.org/10.1007/s10648-006-9013-4 Boekaerts, M., Zeidner, M., & Pintrich, P. R. (Eds.). (1999). Handbook of self-regulation. Academic Press. Bollen, K. A. (1989). Structural equations with latent variables. Wiley. Bong, M., & Skaalvik, E. M. (2003). Academic self-concept and self-efficacy: how different are they really? Educational Psychology Review, 15(1), 1–4. https://doi.org/10.1023/A: 1021302408382 Bouwmeester, R. A., Kleijn, R. A. de, van den Berg, I. E., Cate, O. T. ten, van Rijen, H. V., & Westerveld, H. E. (2019). Flipping the medical classroom: Effect on workload, interactivity, motivation and retention of knowledge. Computers & Education, 139, 118–128. https://doi.org/10.1016/j.compedu.2019.05.002 Bradley, D. R., & Wygant, C. R. (1998). Male and female differences in anxiety about statistics are not reflected in performance. Psychological Reports, 80, 245–246. Brehm, J. W., & Self, E. A. (1989). The intensity of motivation. Annual Review of Psychology, 40(1), 109–131. https://doi.org/10.1146/annurev.ps.4.020189.000545 Broers, N. (2002). Selection and use of propositional knowledge in statistical problem solving. Learning and Instruction, 12(3), 323–344. https://doi.org/10.1016/S0959-475 2(01)00025-1 Brown, T. A. (2006). Confirmatory factor analysis for applied research: Methodology in the social sciences (2nd Ed.). The Guilford Press. Brown, D. (2016). The type and linguistic foci of oral corrective feedback in the L2 classroom: A meta-analysis. Language Teaching Research, 20(4), 436–458. https://doi.org/10. 1177/1362168814563200 Brown, G. T. L., & Harris, L. R. (Eds.). (2016). Educational psychology handbook series. Handbook of human and social conditions in assessment. Routledge. https://www.taylor francis.com/books/9781315749136 https://doi.org/10.4324/9781315749136 Browne, M. W., & Cudeck, R. (1992). Alternative ways of assessing model fit. Sociological Methods & Research, 21(2), 230–258. https://doi.org/10.1177/0049124192021002005 Budé, L., Van De Wiel, M. W. J., Imbos, T., Candel, M. J. J. M., Broers, N. J., & Berger, M. P. F. (2007). Students’ achievements in a statistics course in relation to motivational aspects and study behaviour. Statistics Education Research Journal, 6(1), 5–21. https://doi.org/ 10.52041/serj.v6i1.491 Buff, A. (2014). Enjoyment of learning and its personal antecedents: Testing the change– change assumption of the control-value theory of achievement emotions. Learning and Individual Differences, 31, 21–29. https://doi.org/10.1016/j.lindif.2013.12.007 Buhi, E. R., Goodson, P., & Neilands, T. B. (2008). Out of sight, not out of mind: Strategies for handling missing data. American Journal of Health Behavior, 32(1), 83–92. https:// doi.org/10.5993/ajhb.32.1.8 Bühner, M. (2021). Einführung in die Test- und Fragebogenkonstruktion (4th Ed.). Pearson Studium—Psychologie. Pearson Studium. Burgoyne, S., & Eaton, J. (2018). The partially flipped classroom. Teaching of Psychology, 45(2), 154–157. https://doi.org/10.1177/0098628318762894

362

Bibliography

Buri´c, I. (2015). The role of social factors in shaping students’ test emotions: a mediation analysis of cognitive appraisals. Social Psychology of Education, 18(4), 785–809. https:// doi.org/10.1007/s11218-015-9307-9 Burnham, E., & Blankenship, E. (2020). Lessons learned: Revising an online introductory course. CHANCE, 33(4), 50–55. https://doi.org/10.1080/0933248.202.1847961 Burns, R. A., Crisp, D. A., & Burns, R. B. (2020). Re-examining the reciprocal effects model of self-concept, self-efficacy, and academic achievement in a comparison of the Cross-Lagged Panel and Random-Intercept Cross-Lagged Panel frameworks. The British Journal of Educational Psychology, 90(1), 77–91. https://doi.org/10.1111/bjep.12265 Burton, K. D., Lydon, J. E., D’Alessandro, D. U., & Koestner, R. (2006). The differential effects of intrinsic and identified motivation on well-being and performance: Prospective, experimental, and implicit approaches to self-determination theory. Journal of Personality and Social Psychology, 91(4), 750–762. https://doi.org/10.1037/0022-3514.91.4.750 Butcher, K. (2006). Learning from text with diagrams: Promoting mental model development and inference generation. Jounral of Educational Psychology, 98(1), 182–197. https://doi. org/10.1037/0022-0663.98.1.182 Butler, D. L., & Winne, P. H. (1995). Feedback and self-regulated learning: A theoretical synthesis. Review of Educational Research, 65(3), 245. https://doi.org/10.2307/1170684 Butler, R. (1988). Enhancing and undermining intrinsic motivation: The effects of taskinvolving and ego-involving evaluation on interest and performance. British Journal of Educational Psychology, 58(1), 1–14. https://doi.org/10.1111/j.2044-8279.1988.tb0 0874.x Byrne, B. M., Shavelson, R. J., & Muthén, B. (1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105(3), 456–466. https://doi.org/10.1037/0033-2909.105.3.456 Cai, Q., Chen, B., Wu, H., & Trussell, G. (2018). Using differentiated feedback to improve performance in introductory statistics. Innovations in Education and Teaching International, 56(4), 434–445. https://doi.org/10.1080/14703297.2018.1508362 Camacho-Morles, J., Slemp, G. R., Pekrun, R., Loderer, K., Hou, H., & Oades, L. G. (2021). Activity achievement emotions and academic performance: a meta-analysis. Educational Psychology Review, 33(3), 1051–1095. https://doi.org/10.1007/s10648-020-09585-3 Cameron, J., & Pierce, W. D. (1994). Reinforcement, reward, and intrinsic motivation: a meta-analysis. Review of Educational Research, 64(3), 363–423. https://doi.org/10.3102/ 00346543064003363 Campbell, D. J. (1988). Task complexity: A review and analysis. Academy of Management Review, 13(1), 40–52. https://doi.org/10.5465/amr.1988.4306775 Carlson, K. A., & Winquist, J. R. (2011). Evaluating an active learning approach to teaching introductory statistics: a classroom workbook approach. Journal of Statistics Education, 19(1). https://doi.org/10.1080/10691898.2011.11889596 Carmona, J. (2005). Mathematical background and attitudes toward statistics in a sample of spanish college students. Psychological Reports, 97(5), 53. https://doi.org/10.2466/pr.97. 5.53-62 Carnell, L. J. (2008). The effect of a student-designed data collection project on attitudes toward statistics. Journal of Statistics Education, 16(1). https://doi.org/10.1080/106 91898.2008.11889551

Bibliography

363

Carter, C., Carter, R., & Foss, A. (2018). The flipped classroom in a terminal college mathematics course for liberal arts students. AERA Open, 4(1), 1–14. https://doi.org/10.1177/ 2332858418759266 Carver, C. S. (2012). Self-awareness. In M. R. Leary & J. P. Tangney (Eds.), Handbook of self and identity (pp. 50–68). Guilford Press. Carver, C. S., & Scheier, M. F. (2001). On the self-regulation of behavior. Cambridge University Press. Carver, C. S., & Scheier, M. F. (2000). On the structure of behavioral self-regulation. In Handbook of Self-Regulation (pp. 41–84). Elsevier. https://doi.org/10.1016/b978-012109 890-2/50032-9 Cashin, S.E., & Elmore, P.B. (2005). The survey of attitudes toward statistics Scale: A Construct Validity Study. Educational and Psychological Measurement, 65(3), 509–524. Cassady, J. C., & Gridley, B. E. (2005). The effects of online formative and ummative assessment on test anxiety and performance. Journal of Technology, Learning, and Assessment, 4(1). https://ejournals.bc.edu/index.php/jtla/article/view/1648 Cassady, J., Budenz-Anders, J., Pavlechko, G., & Mock, W. (2001; April). The effects of internet-based formative and summative assessment on test anxiety, perceptions of threat, and achievement [Paper presentation]. Annual Meeting of the American Educational Research Association, Seattle, United States. Cerasoli, C., & Nicklin, J. (2014). Intrinsic motivation and extrinsic incentives jointly predict performance: a 40-year meta-analysis. Psychological Bulletin, 140(4), 980–1008. https:// doi.org/1.1037/a0035661 Chan, K., Wan, K., & King, V. (2021). Performance over enjoyment? effect of game-based learning on learning outcome and flow experience. Frontiers in Education, 6, Article 660376. https://doi.org/10.3389/feduc.2021.660376 Chan, S. W., & Ismail, Z. (2014). Developing statistical reasoning assessment instrument for high school students in descriptive statistics. Procedia—Social and Behavioral Sciences, 116, 4338–4343. https://doi.org/10.1016/j.sbspro.2014.01.943 Chan, S., Ismail, Z., & Sumintono, B. (2016). Assessing statistical reasoning in descriptive statistics: A qualitative meta-analysis. Jurnal Teknologi, 78(6–5). https://doi.org/10. 11113/jt.v78.8995 Chance, B., delMas, R., & Garfield, J. (2005). Reasoning about sampling distributions. In D. Ben-Zvi & J. Garfield (Eds.), The challenge of developing statistical literacy, reasoning, and thinking (pp. 295–323). Kluwer. Chang, W., Franke, G. R., & Lee, N. (2016). Comparing reflective and formative measures: New insights from relevant simulations. Journal of Business Research, 69(8), 3177–3185. https://doi.org/10.1016/j.jbusres.2015.12.006 Chans, G. M., & Portuguez Castro, M. (2021). Gamification as a strategy to increase motivation and engagement in higher education chemistry students. Computers, 10(10), 132. https://doi.org/10.3390/computers10100132 Chao, C.-Y., Chen, Y.-T., & Chuang, K.-Y. (2015). Exploring students’ learning attitude and achievement in flipped learning supported computer aided design curriculum: A study in high school engineering education. Computer Applications in Engineering Education, 23(4), 514–526. https://doi.org/10.1002/cae.21622

364

Bibliography

Chen, F. F. (2007). Sensitivity of goodness of fit indexes to lack of measurement invariance. Structural Equation Modeling: A Multidisciplinary Journal, 14(3), 464–504. https://doi. org/10.1080/10705510701301834 Chen, P.-Y., Wu, W., Garnier-Villarreal, M., Kite, B. A., & Jia, F. (2019). Testing measurement invariance with ordinal missing data: A comparison of estimators and missing data techniques. Multivariate Behavioral Research, 55(1), 87–101. https://doi.org/10.1080/ 00273171.2019.1608799 Chen, Y., Wang, Y., Kinshuk, & Chen, N.-S. (2014). Is FLIP enough? Or should we use the FLIPPED model instead? Computers & Education, 79, 16–27. https://doi.org/10.1016/j. compedu.2014.07.004 Cheung, G. W., & Rensvold, R. B. (1999). Testing factorial invariance across groups: A reconceptualization and proposed new method. Journal of Management, 25(1), 1–27. https://doi.org/10.1177/014920639902500101 Chew, P. K. H., & Dillon, D. B. (2014). Statistics anxiety update: Refining the construct and recommendations for a new research agenda. Perspectives on Psychological Science : A Journal of the Association for Psychological Science, 9(2), 196–208. https://doi.org/10. 1177/1745691613518077 Chiesi, F., & Primi, C. (2010). Cognitive and non-cognitive factors related to students’ statistics achievement. Statistics Education Research Journal, 9(1), 6–26. Chiesi, F., & Primi, C. (2009). Assessing statistics attitudes among college students: Psychometric properties of the Italian version of the Survey of Attitudes toward Statistics (SATS). Learning and Individual Differences, 19(2), 309–313. https://doi.org/10.1016/j. lindif.2008.10.008 Chin, W. (1998a). Issues and Opinion on Structural Equation Modeling. MIS Quarterly, 22, 7–16. Chin, W. (1998b). The partial least squares approach to structural equation modeling. In G. A. Marcoulides (Ed.), Quantitative methodology series. Modern methods for business research (pp. 295–336). Lawrence Erlbaum. Cho, M.-H., & Heron, M. L. (2015). Self-regulated learning: the role of motivation, emotion, and use of learning strategies in students’ learning experiences in a self-paced online mathematics course. Distance Education, 36(1), 80–99. https://doi.org/10.1080/ 01587919.2015.1019963 Christenson, S. L., Reschly, A. L., & Wylie, C. (Eds.). (2012). Handbook of research on student engagement. Springer US. https://doi.org/10.1007/978-1-4614-2018-7 Chuang, H.-H., Weng, C.-Y., & Chen, C.-H. (2018). Which students benefit most from a flipped classroom approach to language learning? British Journal of Educational Technology, 49(1), 56–68. https://doi.org/10.1111/bjet.12530 Chungkham, H. S., Ingre, M., Karasek, R., Westerlund, H., & Theorell, T. (2013). Factor structure and longitudinal measurement invariance of the demand control support model: An evidence from the Swedish Longitudinal Occupational Survey of Health (SLOSH). PloS One, 8(8), e70541. https://doi.org/10.1371/journal.pone.0070541 Cilli-Turner, E. (2015). Measuring learning outcomes and attitudes in a flipped introductory statistics course. PRIMUS, 25(9–10), 833–846. https://doi.org/10.1080/1051197.2015. 1046004

Bibliography

365

Clark, J., Kraut, G., Mathews, D., & Wimbish, J. (2007, July 25). The “fundamental theorem” of statistics: classifying student understanding of basic statistical concepts. https://citese erx.ist.psu.edu/viewdoc/download?doi=1.1.1.636.8858&rep=rep1&type=pdf Clark, D. A., & Svinicki, M. (2015). The effect of retrieval on post-task enjoyment of studying. Educational Psychology Review, 27(1), 51–67. https://doi.org/10.1007/s10648-0149272-4 Clark, I. (2012). Formative assessment: assessment is for self-regulated learning. Educational Psychology Review, 24(2), 205–249. https://doi.org/10.1007/s10648-011-9191-6 Clark, N. M., & Zimmerman, B. J. (1990). A Social cognitive view of self-regulated learning about health. Health Education Research, 5(3), 371–379. https://doi.org/10.1093/her/5. 3.371 Clark, N. M., & Zimmerman, B. J. (2014). A social cognitive view of self-regulated learning about health. Health Education & Behavior: The Official Publication of the Society for Public Health Education, 41(5), 485–491. https://doi.org/10.1177/1090198114547512 Cleary, T. J. (2008). Monitoring trends and accuracy of self-efficacy beliefs during interventions: Advantages and potential applications to school-based settings. Psychology in the Schools, 46(2), 154–171. https://doi.org/10.1002/pits.20360 Clem, A.-L., Hirvonen, R., Aunola, K., & Kiuru, N. (2021). Reciprocal relations between adolescents‘ self-concepts of ability and achievement emotions in mathematics and literacy. Contemporary Educational Psychology, 65, 1–11. https://doi.org/10.1016/j.ced psych.2021.101964 Coetzee, S., & Van der Merwe, P. (2010). Industrial psychology students’ attitudes towards statistics. SA Journal of Industrial Psychology, 36(1). https://doi.org/10.4102/sajip.v36 i1.843 Cohen J. (1988). Statistical power analysis for the behavioral sciences. Erlbaum. Cohen, J. (1977). Statistical Power Analysis for the Behavioral Sciences. Academic Press. Cohen, P. A. (1981). Student ratings of instruction and student achievement: A meta-analysis of multisection validity studies. Review of Educational Research, 51(3), 281–309. https:// doi.org/10.3102/00346543051003281 Cole, D. A., Ciesla, J. A., & Steiger, J. H. (2007). The insidious effects of failing to include design-driven correlated residuals in latent-variable covariance structure analysis. Psychological Methods, 12(4), 381–398. https://doi.org/10.1037/1082-989X.12.4.381 Cole, J. S., Bergin, D. A., & Whittaker, T. A. (2008). Predicting student achievement for low stakes tests with effort and task value. Contemporary Educational Psychology, 33(4), 609–624. https://doi.org/10.1016/j.cedpsych.2007.1.002 Combs, J., & Onwuegbuzie, A. (2012). Relationships among attitudes, coping strategies, and achievement in doctoral-level statistics courses: A mixed research study. International Journal of Doctoral Studies, 7, 349–375. https://doi.org/10.28945/1742 Compeau, D. R., & Higgins, C. A. (1995). Computer self-efficacy: Development of a measure and initial test. MIS Quarterly, 19(2), 189. https://doi.org/10.2307/249688 Conrad, M. (2020). Emotionales Erleben und Wissenserwerb im computergestützten Wirtschaftsunterricht. Springer Fachmedien. https://link.springer.com/book/10.1007/ 978-3-658-29013-9 Cook, B. R., & Babon, A. (2017). Active learning through online quizzes: better learning and less (busy) work. Journal of Geography in Higher Education, 41(1), 24–38. https://doi. org/10.1080/03098265.2016.1185772

366

Bibliography

Corno, L., & Anderman, E. M. (Eds.). (2016). Handbook of educational psychology (Third edition). Routledge. Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal of Applied Psychology, 78(1), 98–104. Covington, M. V., & Omelich, C. L. (1987). “I knew it cold before the exam”: A test of the anxiety-blockage hypothesis. Journal of Educational Psychology, 79(4), 393–40. https:// doi.org/10.1037/0022-0663.79.4.393 Credé, M., Roch, S. G., & Kieszczynka, U. M. (2010). Class attendance in college a metaanalytic review of the relationship of class attendance with grades and student characteristics. Review of Educational Research, 80(2), 272–295. https://doi.org/10.3102/003465 4310362998 Crocco, F., Offenholley, K., & Hernandez, C. (2016). A proof-of-concept study of gamebased learning in higher education. Simulation & Gaming, 47(4), 403–422. https://doi. org/10.1177/1046878116632484 Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. Csikszentmihalyi, M. (2014). Flow and the Foundations of Positive Psychology. Springer. Curelaru, V., & Diac, G. (2022). Perceived classroom assessment environment and autonomous motivation as predictors of students’ achievement emotions in relation to learning for baccalaureate exam. Educatia 21(22), 50–64. https://doi.org/10.24193/ ed21.2022.22.06 Curran, P., West, S., & Finch, J. (1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1(1), 16–29. Curran, P. J., & Bollen, K. A. (2006). Latent curve models: A structural equation perspective. Wiley & Sons, Incorporated, John. d’Alessio, M. A. (2018). The effect of microteaching on science teaching self-efficacy beliefs in preservice elementary teachers. Journal of Science Teacher Education, 29(6), 441–467. https://doi.org/10.1080/1046560X.2018.1456883 Dabbagah, N., & Bannan-Ritland B. (2005). Online learning: Concepts, strategies, and applications. Pearson Education. Daniels, L. M., & Stupnisky, R. H. (2012). Not that different in theory: Discussing the control-value theory of emotions in online learning environments. The Internet and Higher Education, 15(3), 222–226. https://doi.org/10.1016/j.iheduc.2012.04.002 Daniels, L. M., Haynes, T. L., Stupnisky, R. H., Perry, R. P., Newall, N. E., & Pekrun, R. (2008). Individual differences in achievement goals: A longitudinal study of cognitive, emotional, and achievement outcomes. Contemporary Educational Psychology, 33(4), 584–608. https://doi.org/10.1016/j.cedpsych.2007.08.002 Dauphinee, T. L., Schau, C., & Stevens, J. J. (1997). Survey of attitudes toward statistics: Factor structure and factorial invariance for women and men. Structural Equation Modeling: A Multidisciplinary Journal, 4(2), 129–141. https://doi.org/10.1080/107055197095 40066 Davari, H., Karami, H., Nourzadeh, S., & Iranmehr, A. (2020). Examining the validity of the Achievement Emotions Questionnaire for measuring more emotions in the foreign language classroom. Journal of Multilingual and Multicultural Development, 1–14. https:// doi.org/10.1080/01434632.202.1766054

Bibliography

367

Davidshofer, C. O., & Murphy, K. R. (2013). Psychological testing: Pearson new international edition: principles and applications. Pearson Education, Limited. Davies, P. G., & Spencer, S. J. (2005). The gender-gap artifact: Women’s underperformance in quantitative domains through the lens of stereotype threat. In A. M. Gallagher & J. C. Kaufman (Eds.), Gender differences in mathematics: An integrative psychological approach (pp. 172–188). Cambridge University Press. Davis, F. D. (1989). Perceived usefulness, perceived ease of use, and user acceptance of information technology. MIS Quarterly, 13(3), 319. https://doi.org/10.2307/249008 Day, I. N. Z., van Blankenstein, F. M., Westenberg, M., & Admiraal, W. (2018). A review of the characteristics of intermediate assessment and their relationship with student grades. Assessment & Evaluation in Higher Education, 43(6), 908–929. https://doi.org/10.1080/ 02602938.2017.1417974 De la Fuente, J., Lahortiga-Ramos, F., Laspra-Solís, C., Maestro-Martín, C., Alustiza, I., Aubá, E., & Martín-Lanas, R. (2020). A structural equation model of achievement emotions, coping strategies and engagement-burnout in undergraduate students: A possible underlying mechanism in facets of perfectionism. International Journal of Environment Research and Public Health, 17, 1–26. de Vries, H., Dijkstra, M., & Kuhlman, P. (1988). Self-efficacy: the third factor besides attitude and subjective norm as a predictor of behavioural intentions. Health Education Research, 3(3), 273–282. https://doi.org/10.1093/her/3.3.273 Deci, E. L., & Ryan, R. M. (1994). Promoting self-determined education. Scandinavian Journal of Educational Research, 38(1), 3–14. https://doi.org/10.1080/003138394038 0101 Deci, E. L., & Ryan, R. M. (2016). Optimizing students’ motivation in the era of testing and pressure: a self-determination theory perspective. In W. C. Liu, J. C. K. Wang, & R. M. Ryan (Eds.), Building autonomous learners (pp. 9–29). Springer Singapore. https://doi. org/10.1007/978-981-287-630-0_2 Deci, E. L., Koestner, R., & Ryan, R. M. (1999). A meta-analytic review of experiments examining the effects of extrinsic rewards on intrinsic motivation. Psychological Bulletin, 125(6), 627–668. https://doi.org/10.1037/0033-2909.125.6.627 Deci, E. L., Koestner, R., & Ryan, R. M. (2001). Extrinsic rewards and intrinsic motivation in education: Reconsidered once again. Review of Educational Research, 71(1), 1–27. https://doi.org/10.3102/00346543071001001 Deci, E. L., Ryan, R. M., & Williams, G. C. (1996). Need satisfaction and the self-regulation of learning. Learning and Individual Differences, 8(3), 165–183. https://doi.org/10.1016/ s1041-6080(96)90013-8 Dehghan, S., Horan, E. M., & Frome, G. (2022). Investigating the impact of the flipped classroom on student learning and enjoyment in an organic chemistry course. Journal of Chemical Education, 99(7), 2512–2519. https://doi.org/10.1021/acs.jchemed.1c01104 delMas, R. (2005). A comparison of mathematical and statistical reasoning. In D. Ben-Zvi & J. Garfield (Eds.), The challenge of developing statistical literacy, reasoning, and thinking (pp. 79–96). Kluwer. Delmas, R., Garfield, J., Ooms, A., & Chance, B. (2007). Assessing students’ conceptual understanding after a first course in statistics. Statistics Education Research Journal, 6(2), 28–58. https://doi.org/10.52041/serj.v6i2.483

368

Bibliography

Demirer, V., & Sahin, I. (2013). Effect of blended learning environment on transfer of learning: An experimental study. Journal of Computer Assisted Learning, 29, 518–529. Derry, S. J., Levin, J. R., Osana, H. P., Jones, M. S., & Peterson, M. (2000). Fostering students’ statistical and scientific thinking: Lessons learned from an innovative college course. American Educational Research Journal, 37(3), 747–773. https://doi.org/10. 3102/00028312037003747 Dettmers, S., Trautwein, U., Lüdtke, O., Goetz, T., Frenzel, A. C., & Pekrun, R. (2011). Students’ emotions during homework in mathematics: Testing a theoretical model of antecedents and achievement outcomes. Contemporary Educational Psychology, 36(1), 25–35. https://doi.org/10.1016/j.cedpsych.2010.10.001 DeVaney, T. A. (2010). Anxiety and Attitude of graduate students in on-campus vs. online statistics courses. Journal of Statistics Education, 18(1). https://doi.org/10.1080/106 91898.201.11889472 Diamantopoulos, A., & Winklhofer, H. M. (2001). Index construction with formative indicators: An alternative to scale development. Journal of Marketing Research, 38(2), 269–277. https://www.jstor.org/stable/1558630 Diamantopoulos, A., & Papadopoulos, N. (2009). Assessing the cross-national invariance of formative measures: Guidelines for international business researchers. Journal of International Business Studies, 41(2), 360–370. https://doi.org/10.1057/jibs.2009.37 Downing, S. M. (2003). Validity: On the meaningful interpretation of assessment data. Medical Education, 37, 830–837. Downing, S. M., & Haladyna, T. M. (2004). Validity threats: Overcoming interference with proposed interpretations of assessment data. Medical Education, 38(3), 327–333. https:// doi.org/10.1046/j.1365-2923.2004.01777.x Doyle, R. A., & Voyer, D. (2016). Stereotype manipulation effects on math and spatial test performance: A meta-analysis. Learning and Individual Differences, 47, 103–116. Duncan, C., Kim, M., Baek, S., Wu, K. Y. Y., & Sankey, D. (2021). The limits of motivation theory in education and the dynamics of value-embedded learning (VEL). Educational Philosophy and Theory, 54(5), 618–629. https://doi.org/10.1080/00131857.2021. 1897575 Eberl, M. (2004). Formative und reflektive Indikatoren im Forschungsprozess: Entscheidungsregeln und die Dominanz des reflektiven Modells. EFOplan, 19, 1–44. https://www. imm.bwl.unimuenchen.de/forschung/schriftenefo/ap_efoplan_19.pdf Eberl, M. & Mitschke-Collande, D. von (2006, June). Die Verträglichkeit kovarianz- und varianzbasierter Schätzverfahren für Strukturgleichungsmodelle—Eine Simulationsstudie. LMU München, Institut für Marktorientierte Unternehmensführung. https://www.imm. bwl.uni-muenchen.de/forschung/schriftenefo/3825.pdf Eccles, J. S. (1983). Expectancies, values, and academic behaviors. In J. T. Spence (Ed.), (A Series of books in psychology). Achievement and achievement motives: Psychological and sociological approaches (pp. 75–146). W.H. Freeman. Eccles, J. S., & Wigfield, A. (1995). In the mind of the actor: The structure of adolescents’ achievement task values and expectancy-related beliefs. Personality and Social Psychology Bulletin, 21(3), 215–225. https://doi.org/10.1177/0146167295213003 Eccles, J. S., & Wigfield, A. (2002). Motivational beliefs, values, and goals. Annual Review of Psychology, 53(1), 109–132. https://doi.org/10.1146/annurev.psych.53.100901.135153

Bibliography

369

Eccles, J. S., & Wigfield, A. (2020). From expectancy-value theory to situated expectancyvalue theory: A developmental, social cognitive, and sociocultural perspective on motivation. Contemporary Educational Psychology, 61, 101859. https://doi.org/10.1016/j.ced psych.202.101859 Eid, M., Schneider, C., & Schwenkmezger, P. (1999). Do you feel better or worse? The validity of perceived deviations of mood states from mood traits. European Journal of Personality, 13(4), 283–306. https://doi.org/10.1002/(sici)1099-0984(199907/08)13:4% 3C283::aid-per341%3E3..co;2-0 Eklöf, H. (2010). Skill and will: test-taking motivation and assessment quality. Assessment in Education: Principles, Policy & Practice, 17(4), 345–356. https://doi.org/10.1080/096 9594X.201.516569 Elliott, E. S., & Dweck, C. S. (1988). Goals: An approach to motivation and achievement. Journal of Personality and Social Psychology, 54(1), 5–12. https://doi.org/10.1037/00223514.54.1.5 Elshorbagy A., & Schoenwetter, D. (2002). Engineer morphing: Bridging the gap between classroom teaching and the engineering profession. Int. J. Eng. Educ., 18(3), 295–300. Emmioglu Sarikaya, E., Ok, A., Capa Aydin, Y., & Schau, C. (2018). Turkish version of the survey of attitudes toward statistics: factorial structure invariance by gender. International Journal of Higher Education, 7(2), 121. https://doi.org/10.5430/ijhe.v7n2p121 Emmioglu, E., & Capa Aydin, Y. (2012). Attitudes and achievement in statistics: a metaanalysis study. Statistics Education Research Journal, 11(2), 95–102. https://doi.org/10. 52041/serj.v11i2.332 Enders, C. K. (2010). Applied missing data analysis. The Guilford Press. Enders, C. K. (Ed.). (2010). Methodology in the social sciences. Applied missing data analysis. Guilford Press. https://ebookcentral.proquest.com/lib/kxp/detail.action?docID= 533872 Enders, C. K., & Baraldi, A. N. (2018). Missing data handling methods. In The wiley handbook of psychometric testing (pp. 139–185). John Wiley & Sons, Ltd. https://doi.org/10. 1002/9781118489772.ch6 Enders, N., Gaschler, R., & Kubik, V. (2021). Online Quizzes with quizzes with closed questions in formal assessment: how elaborate feedback can promote learning. Psychology Learning & Teaching, 20(1), 91–106. https://doi.org/10.1177/1475725720971205 Engelbrecht, J., Harding, A., & Du Preez, J. (2007). Long-term retention of basic mathematical knowledge and skills with engineering students. European J. Eng. Educ., 32(6): 735–744. https://doi.org/10.1080/03043790701520792 Erzen, E. (2017). The effect of anxiety on student achievement. In E. Karadag (Ed.), The factors effecting student achievement (pp. 75–94). Springer International Publishing. https:// doi.org/10.1007/978-3-319-56083-0_5 Evans, B., & Culp, R. (2015). Online quiz time limits and learning outcomes in economics. E-Journal of Business Education & Scholarship of Teaching, 9(1), 87–96. Evans, C. (2013). Making sense of assessment feedback in higher education. Review of Educational Research, 83(1), 70–120. https://doi.org/10.3102/0034654312474350 Evans, D. J. R., Zeun, P., & Stanier, R. A. (2014). Motivating student learning using a formative assessment journey. Journal of Anatomy, 224(3), 296–303. https://doi.org/10.1111/ joa.12117

370

Bibliography

Evans, T., Kensington-Miller, B., & Novak, J. (2021). Effectiveness, efficiency, engagement: Mapping the impact of pre-lecture quizzes on educational exchange. Australasian Journal of Educational Technology, 163–177. https://doi.org/10.14742/ajet.6258 Eysenck, M. W., & Calvo, M. G. (1992). Anxiety and performance: the processing efficiency theory. Cognition & Emotion, 6(6), 409–434. https://doi.org/10.1080/026999392 08409696 Fairclough, D. L. (2010). Design and analysis of quality of life studies in clinical trials (2nd Ed.). CRC Press. Farmus, L., Cribbie, R. A., & Rotondi, M. A. (2020). The flipped classroom in introductory statistics: early evidence from a systematic review and meta-analysis. Journal of Statistics Education, 28(3), 316–325. https://doi.org/https://doi.org/10.1080/10691898. 202.1834475 Ferguson, E., & Cox, T. (1993). Exploratory factor analysis: A users’ guide. International Journal of Selection and Assessment, 1(2), 84–94. https://doi.org/10.1111/j.1468-2389. 1993.tb00092.x Ferguson, E., & Cox, T. (1993). Exploratory factor analysis: A users?Guide. International Journal of Selection and Assessment, 1(2), 84–94. https://doi.org/10.1111/j.1468-2389. 1993.tb00092.x Fidalgo-Blanco, A., Martinez-Nuñez, M., Borrás-Gene, O., & Sanchez-Medina, J. J. (2017). Micro flip teaching – An innovative model to promote the active involvement of students. Computers in Human Behavior, 72, 713–723. https://doi.org/10.1016/j.chb.2016.07.060 Fierro-Suero, S., Almagro, B. J., & Sáenz-López, P. (2020). Validation of the achievement emotions questionnaire for physical education (AEQ-PE). International Journal of Environmental Research and Public Health, 17(12). https://doi.org/10.3390/ijerph17124560 Finney, S. J., & DiStefano, C. (2006). Nonnormal and categorical data in structural equation modeling. In G. R. Hancock & R. O. Mueller (Eds.), Structural equation modeling: A second course (pp. 269–314). Information Age Publishing. Finney, S. J., & Schraw, G. (2003). Self-efficacy beliefs in college statistics courses. Contemporary Educational Psychology, 28(2), 161–186. https://doi.org/10.1016/s0361-476 x(02)00015-2 Fischer, F., Schult, J., & Hell, B. (2013). Sex differences in secondary school success: why female students perform better. European Journal of Psychology of Fitzmaurice, G. M. (2004). Applied longitudinal analysis. Wiley series in probability and statistics. Wiley-Interscience. https://doi.org/10.1002/9781119513469 Flake, J. K., Barron, K. E., Hulleman, C., McCoach, B. D., & Welsh, M. E. (2015). Measuring cost: The forgotten component of expectancy-value theory. Contemporary Educational Psychology, 41, 232–244. https://doi.org/10.1016/j.cedpsych.2015.03.002 Fong, C. J., & Kremer, K. P. (2020). An expectancy-value approach to math underachievement: examining high school achievement, college attendance, and stem interest. Gifted Child Quarterly, 64(2), 67–84. https://doi.org/10.1177/0016986219890599 Fong, C. J., Williams, K. M., Williamson, Z. H., Lin, S., Kim, Y. W., & Schallert, D. L. (2018). “Inside out”: Appraisals for achievement emotions from constructive, positive, and negative feedback on writing. Motivation and Emotion, 42(2), 236–257. https://doi. org/10.1007/s11031-017-9658-y

Bibliography

371

Fornell, C., & Bookstein, F. L. (1982). Two structural equation models: LISREL and PLS applied to consumer exit-voice theory. Journal of Marketing Research, 19(4), 440–452. https://www.jstor.org/stable/3151718 Fornell, C., & Larcker, D. F. (1981). Evaluating structural equation models with unobservable variables and measurement error. Journal of Marketing Research, 18, 39–5. Förster, M., & Maur, A. (2015). Statistics anxiety and self-concept of beginning students in the social sciences—A matter of gender and socio-cultural background? Zeitschrift für Hochschulentwicklung (ZfHE), 10(4), 67–9. Förster, M., & Maur, A. (2016, April): Analyzing Change in Students’ Statistics Self-Concept and Anxiety [Paper presentation]. Annual Meeting of the American Educational Research Association, Washington, DC. Förster, M., Maur, A., & Bauer, T. (2022). Dropout in a blended learning and a traditional course—the role of course design and emotions in and outside the classroom. Manuscript submitted for publication. Förster, M., Weiser, C. & Maur, A. (2018). How feedback provided by voluntary electronic quizzes affects learning outcomes of university students in large classes. Computers & Education, 121, 100–114. Franceschini, G., Galli, S., Chiesi, F., & Primi, C. (2014). Implicit gender-math stereotype and women´s susceptibility to stereotype threat as a stereotype lift. Learning and Individual Differences, 32, 273–277. Fredricks, J. A., & McColskey, W. (2012). The measurement of student engagement: a comparative analysis of various methods and student self-report instruments. In S. L. Christenson, A. L. Reschly, & C. Wylie (Eds.), Handbook of research on student engagement (pp. 763–782). Springer US. https://doi.org/10.1007/978-1-4614-2018-7_37 Frenzel, A. C., Pekrun, R., & Goetz, T. (2007). Girls and mathematics—A ‘hopeless’ issue? A control-value approach to gender differences in emotions towards mathematics. European Journal of Psychology of Education, 22(4), 497–514. Frenzel, A. C., Thrash, T. M., Pekrun, R., & Goetz, T. (2007). Achievement emotions in Germany and China: A cross-cultural validation of the Academic Emotions QuestionnaireMathematics. Journal of Cross-Cultural Psychology, 38(3), 302–309. Frenzel, A. C., Thrash, T. M., Pekrun, R., & Goetz, T. (2007). Achievement emotions in Germany and China. Journal of Cross-Cultural Psychology, 38(3), 302–309. https://doi.org/ 10.1177/0022022107300276 Fulton, R., & Fulton, D. (2020). A simulation, persistence, engagement, and feedback impact performance in a computer networking course. Developments in Business Simulation and Experiential Learning, 47, 77–89. Fyfe, E. R., Rittle-Johnson, B., & DeCaro, M. S. (2012). The effects of feedback during exploratory mathematics problem solving: Prior knowledge matters. Journal of Educational Psychology, 104(4), 1094–1108. https://doi.org/10.1037/a0028389 Gäde, J., Schermelleh-Engel, K., & Brandt, H. (2020). Konfirmatorische Faktorenanalyse (CFA). In H. Moosbrugger & A. Kelava (Eds.), Lehrbuch. Testtheorie und Fragebogenkonstruktion (3rd ed., pp. 615–660). Springer. Gal, I., & Ginsburg, L. (1994). The role of beliefs and attitudes in learning statistics: Towards an assessment framework. Journal of Statistics Education, 2(2). https://doi.org/10.1080/ 10691898.1994.11910471

372

Bibliography

Gangire, Y., Da Veiga, A., & Herselman, M. (2020). Information security behavior: Development of a measurement instrument based on the self-determination theory. In N. Clarke & S. Furnell (Eds.)., Human aspects of information security and assurance: 14th IFIP WG 11.12 International Symposium, HAISA 202. Mytilene, Lesbos, Greece, July 8–10, 2020: Proceedings (pp. 144–160). Springer. Gannaway, D., Green, T., & Mertova, P. (2017). So how big is big? Investigating the impact of class size on ratings in student evaluation. Assessment, & Evaluation in Higher Education, 43, 175–184 https://doi.org/10.1080/02602938.2017.1317327 Garfield, J., & Ben-Zvi, D. (2007). How Students Learn Statistics Revisited: A Current Review of Research on Teaching and Learning Statistics. International Statistical Review, 75(3), 372–396. Garfield, J., Le, L., Zieffler A., & Ben-Zvi, D. (2014). Developing students’ reasoning about samples and sampling variability as a path to expert statistical thinking. Educational Studies in Mathematics, 88(3), 327–342. https://doi.org/10.1007/s10649-014-9541-7 Garfield, J. (1995). How students learn statistics. International Statistical Review / Revue Internationale De Statistique, 63(1), 25. https://doi.org/10.2307/1403775 Garfield, J. (2002). The challenge of developing statistical reasoning. Journal of Statistics Education, 10(3). https://doi.org/10.1080/10691898.2002.11910676 Garfield, J., & Ahlgren, A. (1988). Difficulties in learning basic concepts in probability and statistics: Implications for research. Journal for Research in Mathematics Education, 19(1), 44–63. https://doi.org/10.5951/jresematheduc.19.1.0044 Garfield, J., & Chance, B. (2000). Assessment in statistics education: Issues and challenges. Mathematical Thinking and Learning, 2(1–2), 99–125. https://doi.org/10.1207/s15327 833mtl0202_5 Garrison, D. R., & Kanuka, H. (2004). Blended learning: uncovering its transformative potential in higher education. Internet and Higher Education, 7(2), 95–105. https://doi. org/10.1016/j.iheduc.2004.02.001 Gaspard, H., Dicke, A.-L., Flunger, B., Brisson, B. M., Häfner, I., Nagengast, B., & Trautwein, U. (2015). Fostering adolescents’ value beliefs for mathematics with a relevance intervention in the classroom. Developmental Psychology, 51(9), 1226–124. https://doi.org/10.1037/dev0000028 Geiser, C. (2013). Data Analysis with Mplus. Guilford Publications. Geiser, C. (2010). Datenanalyse mit Mplus: Eine anwendungsorientierte Einführung. VS Verlag für Sozialwissenschaften. Geiser, C. (2020). Longitudinal structural equation modeling with Mplus: A latent state-trait perspective. Methodology in the social sciences. The Guilford Press. Geiser, C. (2020). Longitudinal structural equation modeling with mplus: A latent state-trait perspective. Guilford Publications. Geiser, C., & Lockhart, G. (2012). A comparison of four approaches to account for method effects in latent state-trait analyses. Psychological Methods, 17(2), 255–283. https://doi. org/10.1037/a0026977 Geiser, C., Bishop, J., & Lockhart, G. (2015). Collapsing factors in multitrait-multimethod models: Examining consequences of a mismatch between measurement design and model. Frontiers in Psychology, 6. https://doi.org/10.3389/fpsyg.2015.00946 Giannakos, M. N., Chorianopoulos, K., & Chrisochoides, N. (2015). Making sense of video analytics: Lessons learned from clickstream interactions, attitudes, and learning outcome

Bibliography

373

in a video-assisted course. The International Review of Research in Open and Distributed Learning, 16(1). https://doi.org/10.19173/irrodl.v16i1.1976 Gikandi, J.W., Morrow, D., & Davis, N.E. (2011) Online formative assessment in higher education: A review of the literature. Computers & Education, 57(4), 2333–2351. Gilboy, M. B., Heinerichs, S., & Pazzaglia, G. (2015). Enhancing student engagement using the flipped classroom. Journal of Nutrition Education and Behavior, 47(1), 109–114. https://doi.org/10.1016/j.jneb.2014.08.008 Gist, M. E. (1987). Self-efficacy: Implications for organizational behavior and human resource management. Academy of Management Review, 12(3), 472–485. https://doi.org/ 10.5465/amr.1987.4306562 Gist, M. E., & Mitchell, T. R. (1992). Self-Efficacy: A theoretical analysis of its determinants and malleability. Academy of Management Review, 17(2), 183–211. https://doi.org/10. 5465/amr.1992.4279530 Goetz, T., Bieg, M., Lüdtke, O., Pekrun, R., & Hall, N. C. (2013). Do girls really experience more anxiety in mathematics? Psychological Science, 24(10), 2079–2087. https://doi.org/ 10.1177/0956797613486989 Goetz, T., Lipnevich, A. A., Krannich, M., & Gogol, K. (2018). Performance feedback and emotions. In A. A. Lipnevich & J. K. Smith (Eds.), The Cambridge handbook of instructional feedback (pp. 554–575). Cambridge University Press. Goetz, T., Nett, U. E., Martiny, S. E., Hall, N. C., Pekrun, R., Dettmers, S., & Trautwein, U. (2012). Students’ emotions during homework: Structures, self-concept antecedents, and achievement outcomes. Learning and Individual Differences, 22(2), 225–234. https://doi. org/10.1016/j.lindif.2011.04.006 Goetz, T., Preckel, F., Pekrun, R., & Hall, N. C. (2007). Emotional experiences during test taking: Does cognitive ability make a difference? Learning and Individual Differences, 17(1), 3–16. https://doi.org/10.1016/j.lindif.2006.12.002 Goetz, T., Sticca, F., Pekrun, R., Murayama, K., & Elliot, A. J. (2016). Intraindividual relations between achievement goals and discrete achievement emotions: An experience sampling approach. Learning and Instruction, 41, 115–125. https://doi.org/10.1016/j.lea rninstruc.2015.1.007 Gold, M. S., & Bentler, P. M. (2000). Treatments of missing data: A Monte Carlo comparison of RBHDI, iterative regression imputation, and expectation-maximization. Structural Equation Modeling, 7, 319–355. https://doi.org/10.1207/S15328007SEM0703_1 Goldman, A. D., & Penner, A. M. (2016). Exploring international gender differences in mathematics self-concept. International Journal of Adolescence and Youth, 21(4), 403–418. Gómez, O., García-Cabrero, B., Hoover, M. L., Castañeda-Figueiras, S., Guevara Benítez, Y. (2020). Achievement emotions in mathematics: Design and evidence of validity of a self-report scale. Journal of Education and Learning, 9(5), 1–15. Gómez, O., García-Cabrero, B., Hoover, M. L., Castañeda-Figueiras, S., & Benítez, Y. G. (2020). Achievement emotions in mathematics: design and evidence of validity of a selfreport scale. Journal of Education and Learning, 9(5), 233. https://doi.org/10.5539/jel. v9n5p233 González, A., Rodríguez, Y., Faílde, J. M., & Carrera, M. V. (2016). Anxiety in the statistics class: Structural relations with self-concept, intrinsic value, and engagement in two samples of undergraduates. Learning and Individual Differences, 45, 214–221. https://doi.org/ 10.1016/j.lindif.2015.12.019

374

Bibliography

Graham, C. (2006). Blended learning systems: definition, current trends and, future directions. In C. J. Bank, & C. R. Graham (Eds.), Handbook of blended learning: Global perspectives, local designs. Pfeiffer. Griffin, B. W. (2016). Perceived autonomy support, intrinsic motivation, and student ratings of instruction. Studies in Educational Evaluation, 51, 116–125. https://doi.org/10.1016/j. stueduc.2016.1.007 Groß, L., Boger, M., Hamann, S., & Wedjelek, M. (2012). ZEITlast—Lehrzeit und Lernzeit: Studierbarkeit der BA-/BSc-und MA/MSc-Studiengänge als Adaption von Lehrorganisation und Zeitmanagement unter Berücksichtigung von Fächerkultur und neuen Technologien [= Studyability of the BA/BSc and MA/MSc study programmes as an adaptation of teaching organisation and time management taking into account subject culture and new technologies]. https://www.blogs.uni-mainz.de/medienpaedagogik/files/2014/ 03/Abschlussbericht_ZEITLast.pdf Accessed 07 April 2020. Gruber, H., & Mohe, M. (2012). Professional knowledge is (also) knowledge about errors. In J. Bauer, & C. Harteis (Eds.), Human Fallibility: The Ambiguity of Errors for Work and Learning (pp. 71–90). Springer. Grund, A., & Fries, S. (2018). Understanding procrastination: A motivational approach. Personality and Individual Differences, 121(15), 120–13. https://doi.org/1.1016/j.paid.2017. 09.035 Gundlach, E., Richards, K. A. R., Nelson, D., & Levesque-Bristol, C. (2015). A comparison of student attitudes, statistical reasoning, performance, and perceptions for webaugmented traditional, fully online, and flipped sections of a statistical literacy class. Journal of Statistics Education, 23(1). https://doi.org/10.1080/10691898.2015.11889723 Guo, J., Marsh, H. W., Parker, P. D., Morin, A. J., & Dicke, T. (2017). Extending expectancyvalue theory predictions of achievement and aspirations in science: Dimensional comparison processes and expectancy-by-value interactions. Learning and Instruction, 49, 81–91. https://doi.org/10.1016/j.learninstruc.2016.12.007 Guo, J., Marsh, H. W., Parker, P. D., Morin, A. J., & Yeung, A. S. (2015). Expectancy-value in mathematics, gender and socioeconomic background as predictors of achievement and aspirations: A multi-cohort study. Learning and Individual Differences, 37, 161–168. https://doi.org/10.1016/j.lindif.2015.01.008 Guo, J., Nagengast, B., Marsh, H. W., Kelava, A., Gaspard, H., Brandt, H., Cambria, J., Flunger, B., Dicke, A.-L., Häfner, I., Brisson, B., & Trautwein, U. (2016). Probing the unique contributions of self-concept, task values, and their interactions using multiple value facets and multiple academic outcomes. AERA Open, 2(1). https://doi.org/10.1177/ 2332858415626884 Guo, J., Parker, P. D., Marsh, H. W., & Morin, A. J. S. (2015). Achievement, motivation, and educational choices: A longitudinal study of expectancy and value using a multiplicative perspective. Developmental Psychology, 51(8), 1163–1176. https://doi.org/10.1037/a00 39440 Guo, Y. R., & Goh, D. H.-L. (2016). Evaluation of affective embodied agents in an information literacy game. Computers & Education, 103, 59–75. https://doi.org/10.1016/j.com pedu.2016.09.013 Hacker, D. J. (Ed.). (2009). The educational psychology series. Handbook of metacognition in education (1. publ). Routledge.

Bibliography

375

Hadie, S., Simok, A., Shamsuddin, S., & Mohammad, S. (2019). Determining the impact of pre-lecture educational video on comprehension of a difficult gross anatomy lecture. Journal of Taibah University Medical Sciences, 14(4), 395–401. Hagger-Johnson, G., Batty, G. D., Deary, I. J., & von Stumm, S. (2011). Childhood socioeconomic status and adult health: Comparing formative and reflective models in the Aberdeen Children of the 1950s Study (prospective cohort study). Journal of Epidemiology & Community Health, 65(11), 1024–1029. https://doi.org/10.1136/jech.2010.127696 Hair, J. F., Ringle, C. M., & Sarstedt, M. (2011). PLS-SEM: Indeed a silver bullet. The Journal of Marketing Theory and Practice, 19(2), 139–151. http://dx.doi.org/10.2753/MTP 1069-6679190202 Hair, J., Sarstedt, M., Hopkins, L., & Kuppelwieser, V. (2014). Partial least squares structural equation modeling (PLS-SEM): An emerging tool for business research. European Business Review, 26(2), 106–121. http://dx.doi.org/10.1108/EBR-10-2013-0128 Hammad, S., Graham, T., Dimitriadis, C., & Taylor, A. (2022). Effects of a successful mathematics classroom framework on students’ mathematics self-efficacy, motivation, and achievement: a case study with freshmen students at a university foundation programme in Kuwait. International Journal of Mathematical Education in Science and Technology, 53(6), 1502–1527. https://doi.org/10.1080/0020739X.202.1831091 Hancock, T. E., Thurman, R. A., & Hubbard, D. C. (1995). An expanded control model for the use of instructional feedback. Contemporary Educational Psychology, 20(4), 410– 425. https://doi.org/10.1006/ceps.1995.1028 Händel, M., Artelt, C., & Weinert, S. (2013). Assessing metacognitive knowledge: development and evaluation of a test instrument. Journal for educational research online, 5, 162–188. Handke, J. (2015). Shift Learning Activities—vom Inverted Classroom Mastery Model zum xMOOC. In N. Nistor & S. Schirlitz (Eds.), Digitale Medien und Interdisziplinarität (p. 113–123). Waxmann. Handke, J. (2020). Von der klassischen Vorlesung zur Digitalen Integration. In Lob der Vorlesung (pp. 227–245). Springer Fachmedien Wiesbaden. https://doi.org/10.1007/978-3658-29049-8_10 Hanna, D., Shevlin, M., & Dempster, M. (2008). The structure of the statistics anxiety rating scale: A confirmatory factor analysis using UK psychology students. Personality and Individual Differences, 45(1), 68–74. https://doi.org/10.1016/j.paid.2008.02.021 Hannigan, A., Hegarty, A. C., & McGrath, D. (2014). Attitudes towards statistics of graduate entry medical students: The role of prior learning experiences. BMC Medical Education, 14(1). https://doi.org/10.1186/1472-6920-14-70 Happ, R., & Zlatkin-Troitschanskaia, O. (2015). Vergleichende Analysen zur Heterogenität der Stu-dierenden in wirtschaftswissenschaftlichen Studiengängen—kritische Implikationen für die Evaluation in Studium und Lehre [= Comparative Analyses of the Heterogeneity of Students in Economic Studies—Critical Implications for Evaluation in Studies and Teaching]. In S. Harris-Hümmert, L. Mitterauer, & P. Pohlenz (Eds.), Heterogenität der Studierenden: Herausforderung für die Qualitätsentwicklung in Studium und Lehre, neuer Fokus für die Evaluation? [= Heterogeneity of students: Challenge for quality development in studies and teaching, new focus for evaluation?] (pp. 149–165). UVW. Harackiewicz, J. M., Canning, E. A., Tibbetts, Y., Priniski, S. J., & Hyde, J. S. (2016). Closing achievement gaps with a utility-value intervention: Disentangling race and social

376

Bibliography

class. Journal of Personality and Social Psychology, 111(5), 745–765. https://doi.org/10. 1037/pspp0000075 Harackiewicz, J. M., Rozek, C. S., Hulleman, C. S., & Hyde, J. S. (2012). Helping parents to motivate adolescents in mathematics and science: An experimental test of a utility-value intervention. Psychological Science, 23(8), 899–906. https://doi.org/10.1177/095679761 1435530 Harackiewicz, J. M., Smith, J. L., & Priniski, S. J. (2016). Interest matters: The importance of promoting interest in education. Policy Insights from the Behavioral and Brain Sciences, 3(2), 220–227. https://doi.org/10.1177/2372732216655542 Haraldseid C, Friberg F., & Aase K. (2015). Nursing students’ perceptions of factors influencing their learning environment in a clinical skills laboratory: a qualitative study. Nurse Educ Today, 35(9). https://doi.org/10.1016/j.nedt.2015.03.015 Harley, J. M., Lajoie, S. P., Frasson, C., & Hall, N. C. (2017). Developing emotion-aware, advanced learning technologies: a taxonomy of approaches and features. International Journal of Artificial Intelligence in Education, 27(2), 268–297. https://doi.org/10.1007/ s40593-016-0126-8 Harley, J. M., Lou, N. M., Liu, Y., Cutumisu, M., Daniels, L. M., Leighton, J. P., & Nadon, L. (2021). University students’ negative emotions in a computer-based examination: the roles of trait test-emotion, prior test-taking methods and gender. Assessment & Evaluation in Higher Education, 46(6), 956–972. https://doi.org/10.1080/02602938.202.1836123 Harpe, S. E., Phipps, L. B., & Alowayesh, M. S. (2012). Effects of a learning-centered approach to assessment on students’ attitudes towards and knowledge of statistics. Currents in Pharmacy Teaching and Learning, 4(4), 247–255. https://doi.org/10.1016/j.cptl. 2012.05.002 Hattie, J. (2008). Visible learning. A synthesis of over 800 meta-analyses relating to achievement. Routledge. Hattie, J., & Timperley, H. (2007). The Power of Feedback. Review of Educational Research, 77(1), 81–112. Hattie, J., & Clarke, S. (2019). Visible Learning: Feedback. Routledge Taylor & Francis Group. https://doi.org/10.4324/9780429485480 Hattie, J., & Hattie, J. A. C. (2009). Visible learning: A synthesis of over 800 meta-analyses relating to achievement (Reprinted.). Routledge. He, W., Holton, A., Farkas, G., & Warschauer, M. (2016). The effects of flipped instruction on out-of-class study time, exam performance, and student perceptions. Learning and Instruction, 45, 61–71. https://doi.org/10.1016/j.learninstruc.2016.07.001 Helmke, A., Helmke, T., Heyne, N., Hosenfeld, A., Kleinbub, I., Schrader, F.-W., & Wagner, W. (2007). Erfassung, Bewertung und Verbesserung des Grundschulunterrichts: Forschungsstand, Probleme und Perspektiven. In Qualität von Grundschulunterricht (pp. 17–34). VS Verlag für Sozialwissenschaften. https://doi.org/10.1007/978-3-531-907 55-0_2 Hembree, R. (1988). Correlates, causes, effects, and treatment of test anxiety. Review of Educational Research, 58(1), 47–77. https://doi.org/10.3102/00346543058001047 Hembree, R. (1990). The nature, effects, and relief of mathematics anxiety. Journal for Research in Mathematics Education, 21(1), 33–46. https://doi.org/10.5951/jresemath educ.21.1.0033

Bibliography

377

Henrie, C. R., Halverson, L. R., & Graham, C. R. (2015). Measuring student engagement in technology-mediated learning: A review. Computers & Education, 90, 36–53. https://doi. org/10.1016/j.compedu.2015.09.005 Herman, J. (2020). Student attitudes in a real-world inspired second statistics course. In H. Marchionda, & S. Bateiha (Eds.), Proceedings of the 48th Annual Meeting of the Research Council on Mathematics Learning. Virtual. https://www.rcml-math.org/assets/ Proceedings/RCML%202021%20Proceedings%2022221.pdf Hew, K. F., & Lo, C. K. (2018). Flipped classroom improves student learning in health professions education: A meta-analysis. BMC Medical Education, 18(1), 38. https://doi.org/ 10.1186/s12909-018-1144-z Hilton, S. C., Schau, C., & Olsen, J. A. (2004). Survey of attitudes toward statistics: Factor structure invariance by gender and by administration time. Structural Equation Modeling: A Multidisciplinary Journal, 11(1), 92–109. https://doi.org/10.1207/s15328007sem 1101_7 Hiltz, S., & Shea, P. (2005). The student in the online classroom. In S. Hiltz & R. Goldman (Eds.), Learning Together Online: Research on Asynchronous Learning Networks (pp. 137–160). Lawrence Erlbaum Associates. Hirsch, L. S., & O’Donnell A. M. (2001). Representativeness in statistical reasoning: Identifiying and assessing misconceptions. Journal of Statistics Education, 9(2). Höffler, T. N. (2010). Spatial ability: its influence on learning with visualizations—a metaanalytic review. Educational Psychology Review, 22(3), 245–269. https://doi.org/10. 1007/s10648-010-9126-7 Homburg, C., & Giering, A. (1996). Konzeptualisierung und Operationalisierung komplexer Konstrukte. Marketing ZFP—Journal of Research and Management 18(1), 5–24. Hommik, C., & Luik, P. (2017). Adapting the survey of attitudes towards statistics (sats-36) for estonian secondary school students. Statistics Education Research Journal, 16(1), 228–239. https://doi.org/10.52041/serj.v16i1.229 Hood, S., Barrickman, N., Djerdjian, N., Farr, M., Magner, S., Roychowdhury, H., Gerrits, R., Lawford, H., Ott, B., Ross, K., Paige, O., Stowe, S., Jensen, M., & Hull, K. (2021). “I Like and Prefer to Work Alone”: Social anxiety, academic self-efficacy, and students’ perceptions of active learning. CBE Life Sciences Education, 20(1), ar12. https:// doi.org/10.1187/cbe.19-12-0271 Hooshyar, D., Ahmad, R. B., Yousefi, M., Fathi, M., Horng, S.-J., & Lim, H. (2016). Applying an online game-based formative assessment in a flowchart-based intelligent tutoring system for improving problem-solving skills. Computers & Education, 94, 18–36. https:// doi.org/10.1016/j.compedu.2015.1.013 Horz, H. (2015). Medien. In E. Wild & J. Möller (Eds.), Pädagogische Psychologie (2nd Ed., p. 121–147). Springer. Hoskins, S. L., & van Hooff, J. C. (2005). Motivation and ability: which students use online learning and what influence does it have on their achievement? British Journal of Educational Technology, 36(2), 177–192. Howell, A. J., & Watson, D. C. (2007). Procrastination: Associations with achievement goal orientation and learning strategies. Personality and Individual Differences, 43(1), 167– 178. https://doi.org/10.1016/j.paid.2006.11.017 Hoyle, R. H. (2011). Structural equation modeling for social and personality psychology. SAGE.

378

Bibliography

Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling: A Multidisciplinary Journal, 6(1), 1–55. https://doi.org/10.1080/10705519909540118 Huang, B., Hew, K. F., & Lo, C. K. (2018). Investigating the effects of gamification-enhanced flipped learning on undergraduate students’ behavioral and cognitive engagement. Interactive Learning Environments, 27(8), 1106–1126. https://doi.org/10.1080/1049482.2018. 1495653 Huber, M., & Krause, S. (Eds.). (2018). Bildung und Emotion. Springer VS, 2018. Huberty, C. J., Dresden, J., & Bak, B.-G. (1993). Relations among dimensions of statistical knowledge. Educational and Psychological Measurement, 53(2), 523–532. https://doi. org/10.1177/0013164493053002022 Hulleman, C. S., & Barron, K. E. (2016). Motivation interventions in education: Bridging theory, research, and practice. In L. Corno & E. M. Anderman (Eds.), Handbook of educational psychology (pp. 160–171). Routledge. Hulleman, C. S., Godes, O., Hendricks, B. L., & Harackiewicz, J. M. (2010). Enhancing interest and performance with a utility value intervention. Journal of Educational Psychology, 102(4), 880–895. https://doi.org/10.1037/a0019506 Hurley, T. (2006). Intervention strategies to increase self-efficacy and self-regulation in adaptive on-line learning. In Lecture notes in computer science (pp. 440–444). Springer Berlin Heidelberg. https://doi.org/10.1007/11768012_66 Huxley, G., Mayo, J., Peacey, M., & Richardson, M. (2018). Class size at University. Fiscal Studies, 39(2), 241–264. https://doi.org/10.1111/j.1475-589.2017.12149 Iossi, L. (2007). Strategies for reducing math anxiety in post-secondary students. In S. M. Nielsen & M. S. Plakhotnik (Eds.), Proceedings of the Sixth Annual College of Education Research Conference: Urban and International Education Section (pp. 30–35). Florida International University. Ismail, N. M. (2015). EFL Saudi students’ class emotions and their contributions to their english achievement at taif university. International Journal of Psychological Studies, 7(4), 19. https://doi.org/10.5539/ijps.v7n4p19 Jacob, B., Hofmann, F., Stephan, M., Fuchs, K., Markus, S., & Gläser-Zikuda, M. (2019). Students’ achievement emotions in university courses—does the teaching approach matter? Studies in Higher Education, 44(10), 1768–178. https://doi.org/10.1080/03075079. 2019.1665324 Jain, S., & Dowson, M. (2009). Mathematics anxiety as a function of multidimensional self-regulation and self-efficacy. Contemporary Educational Psychology, 34(3), 240–249. https://doi.org/10.1016/j.cedpsych.2009.05.004 Jang, H., Reeve, J., & Halusic, M. (2016). A new autonomy-supportive way of teaching that increases conceptual learning: teaching in students’ preferred ways. The Journal of Experimental Education, 84(4), 686–701. https://doi.org/10.1080/00220973.2015.108 3522 Jarrell, A., Harley, J. M., Lajoie, S., & Naismith, L. (2017). Success, failure and emotions: examining the relationship between performance feedback and emotions in diagnostic reasoning. Educational Technology Research and Development, 65(5), 1263–1284. https://doi.org/10.1007/s11423-017-9521-6

Bibliography

379

Jarvis, C. B., MacKenzie, S. B., & Podsakoff, P. M. (2003). A critical review of construct indicators and measurement model misspecification in marketing and consumer research. Journal of Consumer Research, 30(2), 199–218. https://doi.org/10.1086/376806 Jarvis, C. B., MacKenzie, S. B., & Podsakoff, P. M. (2003). A critical review of construct indicators and measurement model misspecification in marketing and consumer research. Journal of Consumer Research, 30(2), 199–218. https://doi.org/10.1086/376806 Jdaitawi, M. (2020). Does flipped learning promote positive emotions in science education? A comparison between traditional and flipped classroom approaches. Electronic Journal of E-Learning, 18(6). https://doi.org/10.34190/JEL.18.6.004 Jeong, J. S., González-Gómez, D., & Cañada-Cañada, F. (2016). Students’ perceptions and emotions toward learning in a flipped general science classroom. Journal of Science Education and Technology, 25(5), 747–758. https://doi.org/10.1007/s10956-016-9630-8 Jia, J., Chen, Y., Ding, Z., Bai, Y., Yang, B., Lit, M., & Qit, J. (2013). Effects of an intelligent web-based English instruction system on students’ academic performance. Journal of Computer Assisted Learning, 29, 556–568. Jiang, Y., Rosenzweig, E. Q., & Gaspard, H. (2018). An expectancy-value-cost approach in predicting adolescent students’ academic motivation and achievement. Contemporary Educational Psychology, 54, 139–152. https://doi.org/10.1016/j.cedpsych.2018.06.005 Jones, G. A., Langrall, C. W., Thornton, C. A., Mooney, E. S., Wares, A., Jones, M. R., Perry, B., Putt, I. J., & Nisbet, S. (2001). Using students’ statistical thinking to inform instruction. The Journal of Mathematical Behavior, 20(1), 109–144. https://doi.org/10. 1016/s0732-3123(01)00064-5 Kalyuga, S. (2007). Expertise reversal effect and its implications for learner-tailored instruction. Education Psychology Review, 19, 509–539. Kalyuga, S., & Sweller, J. (2004). Measuring knowledge to optimize cognitive load factors during instruction. Journal of Educational Psychology, 96(3), 558–568. Kalyuga, S., Ayres, P., Chandler, P., & Sweller, J. (2003). The expertise reversal effect. Educational Psychologist, 38(1), 23–31. Kang, E., & Han, Z. (2015). The efficacy of written corrective feedback in improving l2 written accuracy: A meta-analysis. The Modern Language Journal, 99(1), 1–18. https://doi. org/10.1111/modl.12189 Karadag, E. (Ed.). (2017). The factors effecting student achievement. Springer International Publishing. https://doi.org/10.1007/978-3-319-56083-0 Karaman Pınar (2021). The effect of formative assessment practices on student learning: a meta-analysis study. International Journal of Assessment Tools in Education, 801–817. https://doi.org/10.21449/ijate.870300 Kauermann, G. (2015). Anwendungsorientierte Studiengänge und Fächerkombinationen— Statistik. In Springer Spektrum (Ed.), Studien- und Berufsplaner Mathematik. Schlüsselqualifikationen für Technik, Wirtschaft und IT (5th Ed., p. 120–123). Springer Spektrum. Kelley, C. M., & McLaughlin, A. C. (2009). Feedback specificity requirements for learning in younger and older adults: The role of cognitive resources and task demand. Human Factors and Ergonomics Society Annual Meeting Proceedings, 53(22), 1699–1703. https:// doi.org/10.1518/107118109X12524444081511

380

Bibliography

Kerby, A. T., & Wroughton, J. R. (2017). When do students’ attitudes change? Investigating student attitudes at midterm. Statistics Education Research Journal, 16(2), 476–486. https://doi.org/10.52041/serj.v16i2.202 KERBY, A. T., & WROUGHTON, J. R. (2021). When do students’ attitudes change? Investigating student attitudes at midterm. Statistics Education Research Journal, 16(2), 476–486. https://doi.org/10.52041/serj.v16i2.202 Ketonen, E. E., Dietrich, J., Moeller, J., Salmela-Aro, K., & Lonka, K. (2018). The role of daily autonomous and controlled educational goals in students’ academic emotion states: An experience sampling method approach. Learning and Instruction, 53, 10–2. https:// doi.org/10.1016/j.learninstruc.2017.07.003 Khanna, M. M. (2015). Ungraded pop quizzes. Teaching of Psychology, 42(2), 174–178. https://doi.org/10.1177/0098628315573144 Khavenson, T., Orel, E., & Tryakshina, M. (2012). Adaptation of survey of attitudes towards statistics (sats 36) for Russian sample. Procedia—Social and Behavioral Sciences, 46, 2126–2129. https://doi.org/10.1016/j.sbspro.2012.05.440 Kher, H. V., Downey, J. P., & Monk, E. (2013). A longitudinal examination of computer selfefficacy change trajectories during training. Computers in Human Behavior, 29(4), 1816– 1824. https://doi.org/10.1016/j.chb.2013.02.022 Kibble, J. (2007). Use of unsupervised online quizzes as formative assessment in a medical physiology course: Effects of incentives on student participation and performance. Advances in Physiology Education, 31(3), 253–26. https://doi.org/10.1152/advan.00027. 2007. Kibble, J. D. (2017). Best practices in summative assessment. Advances in Physiology Education, 41(1), 110–119. https://doi.org/10.1152/advan.00116.2016. Kiekkas, P., Panagiotarou, A., Malja, A., Tahirai, D., Zykai, R., Bakalis, N., & Stefanopoulos, N. (2015). Nursing students’ attitudes toward statistics: Effect of a biostatistics course and association with examination performance. Nurse Education Today, 35(12), 1283– 1288. https://doi.org/10.1016/j.nedt.2015.07.005 Kim, K. R., & Seo, E. H. (2015). The relationship between procrastination and academic performance: A meta-analysis. Personality and Individual Differences, 82, 26–33. https:// doi.org/10.1016/j.paid.2015.02.038 Kim, M., Kim, S., Khera, O., & Getman, J. (2014). The experience of three flipped classrooms in an urban university: an exploration of design principles. The Internet and Higher Education, 22, 37–5. https://doi.org/10.1016/j.iheduc.2014.04.003 Kingston, N., & Nash, B. (2011). Formative assessment: A meta-analysis and a call for research. Educational Measurement: Issues and Practice, 30(4), 28–37. https://doi.org/ 10.1111/j.1745-3992.2011.0022.x Kleftodimos, A., & Evangelidis, G. (2015). An interactive video-based learning environment supporting learning analytics: Insights Obtained from Analyzing Learner Activity Data. In Y. Li, M. Chang, M: Karvcik, E. Popsecu, R. Hungang et al. (Eds.), State-of-the-Art and Future Directions of Smart Learning (pp. 471–481). Springer. Kleftodimos, A., & Evangelidis, G. (2016). An interactive video-based learning environment supporting learning analytics: insights obtained from analyzing learner activity data. In Y. Li, M. Chang, M. Kravcik, E. Popescu, R. Huang, Kinshuk, & N.-S. Chen (Eds.), Lecture Notes in Educational Technology. State-of-the-art and future directions of smart learning (pp. 471–481). Springer Singapore. https://doi.org/10.1007/978-981-287-868-7_56

Bibliography

381

Kleij, F. M., Eggen, T., Timmers, C. F., & Veldkamp, B. P. (2012). Effects of feedback in a computer-based assessment for learning. Computers & Education, 58, 263–272. Kleij, F. M., Feskens, R. C. W., & Eggen, T. J. H. M. (2015). Effects of feedback in a computer-based learning environment on students’ learning outcomes. Review of Educational Research, 85(4), 475–511. https://doi.org/https://doi.org/10.3102/003465431456 4881 Klein, A., & Moosbrugger, H. (2000). Maximum likelihood estimation of latent interaction effects with the LMS method. Psychometrika, 65(4), 457–474. https://doi.org/10.1007/ bf02296338 Kleinke, K., Schlüter, E., & Christ, O. (2017). Strukturgleichungsmodelle mit Mplus: Eine praktische Einführung. De Gruyter. https://ebookcentral.proquest.com/lib/kxp/detail.act ion?docID=4749512 Kline, R. B. (2015). Principles and practice of structural equation modeling, fourth edition. Guilford Publications. Klinke, S., Härdle, W. K., & Rönz, B. (2018). Introduction to statistics: Using interactive mm*stat elements. Springer. Klopp, E., & Klößner, S. (2020, December 9). Failure to Detect Metric Measurement NonInvariance: How Manifest Residual Variances, Indicator Communalities, and Sample Size affect the Chi2-Test Statistic. https://psyarxiv.com/jkxg8/ Kluger, A. N., & DeNisi, A. (1996). The effects of feedback interventions on performance: A historical review, a meta-analysis, and a preliminary feedback intervention Theory. Psychological Bulletin, 19(2), 254–284. KMK Bonn and Berlin (Eds.). (2015). Bildungsstandards im Fach Mathematik für die Allgemeine Hochschulreife: (Beschluss der Kultusministerkonferenz vom 18.1.2012). Wolters Kluwer Deutschland GmbH. Knekta, E., Runyon, C., & Eddy, S. (2019). One size doesn’t fit all: Using factor analysis to gather validity evidence when using surveys in your research. CBE life sciences education, 18(1):rm1, 1–17. https://doi.org/10.1187/cbe.18-04-0064 Koenker, R. (2005). Quantile Regression. Cambridge University Press. Kolenikov, S., & Bollen, K. A. (2012). Testing negative error variances. Sociological Methods & Research, 41(1), 124–167. https://doi.org/10.1177/0049124112442138 Korman, A. K. (1970). Toward an hypothesis of work behavior. Journal of Applied Psychology, 54(1, Pt.1), 31–41. https://doi.org/10.1037/h0028656 Kosovich, J. J., Hulleman, C. S., Barron, K. E., & Getty, S. (2014). A practical measure of student motivation. The Journal of Early Adolescence, 35(5–6), 790–816. https://doi.org/ 10.1177/0272431614556890 Krafft, M., Götz, O., & Liehr-Gobbers, K. (2005). Die Validierung von Strukturgleichungsmodellen mit Hilfe des Partial-Least-Squares (PLS)-Ansatzes. In F. Bliemel, A. Eggert, G. Fassot & J. Henseler (Eds.), Handbuch PLS-Pfadmodellierung: Methode, Anwendung, Praxisbeispiele (p. 71–116). Schäffer-Poeschel. Krapp, A. (1995). Interesse, Lernen und Leistung: Neue Forschungsansätze in der Pädagogischen Psychologie. Zeitschrift für Pädagogik, 38(5), 747–770. Krause, U.-M., Stark, R., & Mandl, H. (2009). The effects of cooperative learning and feedback on e-learning in statistics. Learning and Instruction, 19(2), 158–17. https://doi.org/ 10.1016/j.learninstruc.2008.03.003

382

Bibliography

Kulhavy, R. W., & Stock, W. A. (1989). Feedback in written instruction: The place of response certitude. Educational Psychology Review, 1(4), 279–308. La Fuente, J. de, Lahortiga-Ramos, F., Laspra-Solís, C., Maestro-Martín, C., Alustiza, I., Aubá, E., & Martín-Lanas, R. (2020). A structural equation model of achievement emotions, coping strategies and engagement-burnout in undergraduate students: A possible underlying mechanism in facets of perfectionism. International Journal of Environmental Research and Public Health, 17(6). https://doi.org/10.3390/ijerph17062106 Lai, C.-L., & Hwang, G.-J. (2016). A self-regulated flipped classroom approach to improving students’ learning performance in a mathematics course. Computers & Education, 100, 126–140. https://doi.org/10.1016/j.compedu.2016.05.006 Laird, N. M. (1988). Missing data in longitudinal studies. Statistics in Medicine, 7(1–2), 305– 315. https://doi.org/10.1002/sim.4780070131 Lajoie, S. (2014). Multimedia learning of cognitive processes. In R. Mayer (Ed), The Cambrigde Handbook of Multimedia Learning. 2nd Edition. (pp. 623–646). Cambridge UP. Lam, C. F., DeRue, D. S., Karam, E. P., & Hollenbeck, J. R. (2011). The impact of feedback frequency on learning and task performance: Challenging the “more is better” assumption. Organizational behavior and Human Decision Processes, 116, 217–228. Landis, R., Edwards, B., & Cortina, J. (2010). On the practice of allowing correlated residuals among indicators in structural equation models. In C. Lance & R. Vanderberg (Eds.), Statistical and methodological myths and urban legends (pp. 194–214). Routledge. https:// doi.org/10.4324/9780203867266-16 Latham, G. P., & Arshoff, A. S. (2015). Planning. A mediator in goal-setting theory. In M. D. Mumford & M. Frese (Eds.), The psychology of planning in organizations: Research and applications (pp. 89–104). Routledge. Lauermann, F., Tsai, Y.-M., & Eccles, J. S. (2017). Math-related career aspirations and choices within Eccles et al.’s expectancy-value theory of achievement-related behaviors. Developmental Psychology, 53(8), 1540–1559. https://doi.org/10.1037/dev0000367 Lavidas, K., Barkatsas, T., Manesis, D., & Gialamas, V. (2020). A structural equation model investigating the impact of tertiary students’ attitudes toward statistics, perceived competence at mathematics, and engagement on statistics performance. Statistics Education Research Journal, 19(2), 27–41. Lazarides, R., & Raufelder, D. (Eds.). (2021). Motivation in unterrichtlichen fachbezogenen Lehr-Lernkontexten (Vol. 10). Springer Fachmedien Wiesbaden. https://doi.org/10.1007/ 978-3-658-31064-6 Lazarides, R., & Schiefele, U. (2021). Von der Lehrermotivation zur Schülermotivation: Ein integratives Modell zur motivationalen Entwicklung im Unterricht. In R. Lazarides & D. Raufelder (Eds.), Motivation in unterrichtlichen fachbezogenen Lehr-Lernkontexten (pp. 3–28). Springer Fachmedien Wiesbaden. Lee, H., Chung, H. Q., Zhang, Y., Abedi, J., & Warschauer, M. (2020). The effectiveness and features of formative assessment in us k-12 education: a systematic review. Applied Measurement in Education, 33(2), 124–140. https://doi.org/10.1080/08957347.202.173 2383 Lee, Y [You-kyung], Freer, E., Robinson, K. A., Perez, T., Lira, A. K., Briedis, D., Walton, S. P., & Linnenbrink-Garcia, L. (2022). The multiplicative function of expectancy

Bibliography

383

and value in predicting engineering students’ choice, persistence, and performance. Journal of Engineering Education, 111(3), 531–553. https://doi.org/10.1002/jee.20456 Lei, P.-W., & Shiverdecker, L. K. (2020). Performance of estimators for confirmatory factor analysis of ordinal variables with missing data. Structural Equation Modeling: A Multidisciplinary Journal, 27(4), 584–601. https://doi.org/10.1080/10705511.2019.168 0292 Leitgöb, H. (2017). Ein Verfahren zur Dekomposition von Mode-Effekten in eine mess- und eine repräsentationsbezogene Komponente. In S. Eifler & F. Faulbaum (Eds.), Methodische Probleme von Mixed-Mode-Ansätzen in der Umfrageforschung (p. 51–98). Springer VS. Leutner, D. (2014). Motivation and emotion as mediators in multimedia learning. Learning and Instruction, 29, 174–175. https://doi.org/10.1016/j.learninstruc.2013.05.004 Li, C.-H. (2015). Confirmatory factor analysis with ordinal data: Comparing robust maximum likelihood and diagonally weighted least squares. Behavior Research Methods, 48(3), 936–949. https://doi.org/10.3758/s13428-015-0619-7 Li, S. (2010). The effectiveness of corrective feedback in sla: a meta-analysis. Language Learning, 60(2), 309–365. https://doi.org/10.1111/j.1467-9922.201.00561.x Li, Y., Chang, M., Kravcik, M., Popescu, E., Huang, R., Kinshuk, & Chen, N.-S. (Eds.). (2016). Lecture Notes in Educational Technology. State-of-the-art and future directions of smart learning. Springer Singapore. https://doi.org/10.1007/978-981-287-868-7 Lichtenfeld, S., Pekrun, R., Stupnisky, R. H., Reiss, K., & Murayama, K. (2012). Measuring students’ emotions in the early years: The Achievement Emotions QuestionnaireElementary School (AEQ-ES). Learning and Individual Differences, 22(2), 190–201. https://doi.org/10.1016/j.lindif.2011.04.009 Lipnevich, A. A., & Panadero, E. (2021). A review of feedback models and theories: descriptions, definitions, and conclusions. Frontiers in Education, 6. https://doi.org/10.3389/ feduc.2021.720195 Lipnevich, A. A., & Smith, J. K. (Eds.). (2018). The Cambridge handbook of instructional feedback. Cambridge University Press. https://doi.org/10.1017/9781316832134 Lipnevich, A. A., Berg, D. A., & Smith, J. K. (2016). Toward a model of student response to feedback. In G. T. L. Brown & L. R. Harris (Eds.), Educational psychology handbook series. Handbook of human and social conditions in assessment (pp. 169–185). Routledge. Lipnevich, A. A., Murano, D., Krannich, M., & Goetz, T. (2021). Should I grade or should I comment: Links among feedback, emotions, and performance. Learning and Individual Differences, 89.https://doi.org/10.1016/j.lindif.2021.102020 Lippe, P. von der & Kladroba, A. (2008). Der unaufhaltsame Niedergang der Fächer Statistik und Ökonometrie in den Wirtschaftswissenschaften. AStA Wirtschafts- und Sozialstatistisches Archiv, 2(1), 21–4. Little, R. (2005). Dropouts in longitudinal studies: methods of analysis. Encyclopedia of Statistics in Behavioral Science. Wiley StatsRef. https://onlinelibrary.wiley.com/doi/abs/ 10.1002/9781118445112.stat06596 Little, R. J., & Raghunathan, T. (1999). On summary measures analysis of the linear mixed effects model for repeated measures when data are not missing completely at random. Statistics in Medicine, 18, 2465–2478. Little, R. J., & Raghunathan, T. (1999). On summary measures analysis of the linear mixed effects model for repeated measures when data are not missing completely at random.

384

Bibliography

Statistics in Medicine, 18(17–18), 2465–2478. https://doi.org/10.1002/(sici)1097-025 8(19990915/30)18:17/18%3C2465::aid-sim269%3E3..co;2-2 Little, T. D., & Card, N. A. (2013). Longitudinal structural equation modeling. Guilford Publications. Liu, W. C., Wang, J. C. K., & Ryan, R. M. (Eds.). (2016). Building autonomous learners. Springer Singapore. https://doi.org/10.1007/978-981-287-630-0 Liu, Y., & Sriutaisuk, S. (2019). Evaluation of model fit in structural equation models with ordinal missing data: An examination of the D 2 method. Structural Equation Modeling: A Multidisciplinary Journal, 27(4), 561–583. https://doi.org/10.1080/10705511.2019.166 2307 Lo, C. K., Hew, K. F., & Chen, G. (2017). Toward a set of design principles for mathematics flipped classrooms. Educational Research Review, 22, 50–73. Lo, C. K., & Hew, K. F. (2017). A critical review of flipped classroom challenges in K-12 education: Possible solutions and recommendations for future research. Research and Practice in Technology Enhanced Learning, 12(1), 4. https://doi.org/10.1186/s41039-0160044-2 Lo, C. K., & Hew, K. F. (2021). Student engagement in mathematics flipped classrooms: Implications of journal publications from 2011 to 2020 Student engagement in mathematics flipped classrooms: Implications of journal publications from 2011 to 202. Frontiers in Psychology, 12. https://doi.org/10.3389/fpsyg.2021.672610 Locke, E. A., & Latham, G. P. (2002). Building a practically useful theory of goal setting and task motivation. A 35-year odyssey. The American Psychologist, 57(9), 705–717. https:// doi.org/10.1037//0003-066X.57.9.705 Loderer, K., Pekrun, R., & Lester, J. C. (2020). Beyond cold technology: A systematic review and meta-analysis on emotions in technology-based learning environments. Learning and Instruction, 7. https://doi.org/10.1016/j.learninstruc.2018.08.002 Loeffler, S., Stumpp, J., Grund, S., Limberger, M., & Ebner-Priemer, U. (2019). Fostering self-regulation to overcome academic procrastination using interactive ambulatory assessment. Learning and Individual Differences, 75, https://doi.org/10.1016/j.lindif. 2019.101760 Lohbeck, A., Hagenauer, G., & Frenzel, A. C. (2018). Teachers’ self-concepts and emotions: Conceptualization and relations. Teaching and Teacher Education, 70, 111–12. https:// doi.org/10.1016/j.tate.2017.11.001 Love, B., Hodge, A., Grandgenett, N., & Swift, A. W. (2014). Student learning and perceptions in a flipped linear algebra course. International Journal of Mathematical Education in Science and Technology, 45(3), 317–324. https://doi.org/10.1080/0020739X.2013. 822582 Lovett, M. (2001). A collaborative convergence on studying reasoning processes: A case study in statistics. In S. M. Carver & D. Klahr (Eds.), Cognition and instruction: Twentyfive years of progress (pp. 347–384). Erlbaum. Lüdders, L. & Zeeb, H. (2020). Methoden der empirischen Forschung: Ein Handbuch für Studium und Berufspraxis (Methodenbücher) (1st Ed.). Apollon University Press. Maad, M. (2012). Interaction effect of task demands and goal orientations on language learners’ perceptions of task difficulty and motivation. The Journal of Language Teaching and Learning, 2012(1), 1–14.

Bibliography

385

Macher, D., Paechter, M., Papousek, I., & Ruggeri, K. (2012). Statistics anxiety, trait anxiety, learning behavior, and academic performance. European Journal of Psychology of Education, 27(4), 483–498. Macher, D., Paechter, M., Papousek, I., Ruggeri, K., Freudenthaler, H., & Arendasy, M. (2013). Statistics anxiety, state anxiety during an examination, and academic achievement. British Journal of Educational Psychology, 83, 535–549. https://doi.org/10.3389/ fpsyg.2015.01116 MacKenzie, S. B., Podsakoff, P. M., & Jarvis, C. B. (2005). The problem of measurement model misspecification in behavioral and organizational research and some recommended solutions. Journal of Applied Psychology, 90(4), 710–730. https://doi.org/10. 1037/0021-9010.90.4.710 Magner, U. I., Schwonke, R., Aleven, V., Popescu, O., & Renkl, A. (2014). Triggering situational interest by decorative illustrations both fosters and hinders learning in computerbased learning environments. Learning and Instruction, 29, 141–152. https://doi.org/10. 1016/j.learninstruc.2012.07.002 Mai, Y., Zhang, Z., & Wen, Z. (2018). Comparing exploratory structural equation modeling and existing approaches for multiple regression with latent variables. Structural Equation Modeling: A Multidisciplinary Journal, 25(5), 737–749. https://doi.org/10.1080/107 05511.2018.1444993 Malespina, A., & Singh, C. (2022). Gender differences in test anxiety and self-efficacy: why instructors should emphasize low-stakes formative assessments in physics courses. European Journal of Physics, 43(3), 35701. https://doi.org/10.1088/1361-6404/ac51b1 Marchand, G. C., & Gutierrez, A. P. (2017). Processes involving perceived instructional support, task value, and engagement in graduate education. The Journal of Experimental Education, 85(1), 87–106. https://doi.org/10.1080/00220973.2015.1107522 Marcoulides, G. A. (Ed.). (1998). Quantitative methodology series. Modern methods for business research. Lawrence Erlbaum. Marden, N. Y., Ulman, L. G., Wilson, F. S., & Velan, G. M. (2013). Online feedback assessments in physiology: Effects on students’ learning experiences and outcomes. Advances in Physiology Education, 37(2), 192–200. https://doi.org/10.1152/advan.00092.2012 Marsh, H. (2007). Students’ evaluations of university teaching: Dimensionality, reliability, validity, potential biases and usefulness. In R. P. Perry & J. C. Smart (Eds.), The Scholarship of Teaching and Learning in Higher Education: An Evidence-Based Perspective (pp. 319–383). Springer. Marsh, H. W., Lüdtke, O. Nagengast, B., Morin, A. J., & von Davier, M. (2013). Why item parcels are (almost) never appropriate: Two wrongs do not make a right–Camouflaging misspecification with item parcels in CFA models. Psychological Methods, 18(3), 257– 284. https://doi.org/10.1037/a0032773 Marsh, H. W., Pekrun, R., Parker, P. D., Murayama, K., Guo, J., Dicke, T., & Arens, A. K. (2019). The murky distinction between self-concept and self-efficacy: Beware of lurking jingle-jangle fallacies. Journal of Educational Psychology, 111(2), 331–353. https://doi. org/10.1037/edu0000281 Marsh, H. W., Trautwein, U., Lüdtke, O., Köller, O., & Baumert, J. (2005). Academic self-concept, interest, grades, and standardized test scores: Reciprocal effects models of causal ordering. Child Development, 76(2), 397–416. https://doi.org/10.1111/j.14678624.2005.00853.x

386

Bibliography

Marsh, H. W., Wen, Z., & Hau, K.-T. (2004). Structural equation models of latent interactions: Evaluation of alternative estimation strategies and indicator construction. Psychological Methods, 9(3), 275–30. https://doi.org/10.1037/1082-989X.9.3.275 Marshman, E. M., Kalender, Z. Y., Nokes-Malach, T., Schunn, C., & Singh, C. (2018). Female students with A’s have similar physics self-efficacy as male students with C’s in introductory courses: A cause for alarm? Physical Review Physics Education Research, 14(2). https://doi.org/10.1103/PhysRevPhysEducRes.14.020123 Martin, N., Hughes, J., & Fugelsang, J. (2017). The roles of experience, gender, and individual differences in statistical reasoning. Statistics Education Research Journal, 16(2), 454–475. https://doi.org/10.52041/serj.v16i2.201 Marzano, R. J., Pickering, D. J., & Pollock, J. E. (2001). Classroom instruction that works. Research-based strategies for increasing student achievement. Alexandria, US: Association for Supervision and Curriculum Development. Maslowsky, J., Jager, J., & Hemken, D. (2015). Estimating and interpreting latent variable interactions: A tutorial for applying the latent moderated structural equations method. International Journal of Behavioral Development, 39(1), 87–96. https://doi.org/10.1177/ 0165025414552301 Mason, B. J., & Brnning, R.H. (2001). Providing feedback in computer-based instrnction: What the research tells us. (CLASS Research Report No. 9). Center for Instructional Innovation. Mason, G. S., Shuman, T. R., & Cook, K. E. (2013). Comparing the effectiveness of an inverted classroom to a traditional classroom in an upper-division engineering course. IEEE Transactions on Education, 56(4), 430–435. https://doi.org/10.1109/TE.2013.224 9066 Maydeu-Olivares, A. (2017). Maximum likelihood estimation of structural equation models for continuous data: Standard errors and goodness of fit. Structural Equation Modeling A Multidisciplinary Journal, 24(3), 1–12. Maydeu-Olivares, A. (2017). Maximum likelihood estimation of structural equation models for continuous data: standard errors and goodness of fit. Structural Equation Modeling: A Multidisciplinary Journal, 24(3), 383–394. https://doi.org/10.1080/10705511.2016.126 9606 Mayer, R. E. (1999). Multimedia aids to problem-solving transfer—A dual coding approach. International Journal of Educational Research, 31, 611– 623. Mayo, M., Kakarika, M., Pastor, J. C., & Brutus, S. (2012). Aligning or inflating your leadership self-image? A longitudinal study of responses to peer feedback in MBA teams. Academy of Management Learning & Education, 11(4), 631–652. McIntyre, S. H., & Munson, J. M. (2008). Exploring cramming: Student behaviors, beliefs, and learning retention in the principles of marketing course. Journal of Marketing Education, 30(3), 226–243. https://doi.org/10.1177/0273475308321819 McKenzie, W., Perini, E., Rohlf, V., Toukhsati, S., Conduit, R., & Sanson, G. (2013). A blended learning lecture delivery model for large and diverse undergraduate cohorts. Computers & Education, 64, 116–126. https://doi.org/10.1016/j.compedu.2013.01.009 McNulty, J. A., Espiritu, B. R., Hoyt, A. E., Ensminger, D. C., & Chandrasekhar, A. J. (2014). Associations between formative practice quizzes and summative examination outcomes in a medical anatomy course. Anatomical Sciences Education, 8(1), 37–44. https://doi. org/10.1002/ase.1442

Bibliography

387

Meade, A. W., & Kroustalis, C. M. (2006). Problems with item parceling for confirmatory factor analytic tests of measurement invariance. Organizational Research Methods, 9(3), 369–403. https://doi.org/10.1177/1094428105283384 Means, B., Bakia, M., & Murphy, R. (2014). Learning Online: What Research Tells Us About Whether, When, and How. Routledge. Melad, A. (2022). Students’ attitude and academic achievement in statistics: a correlational study. Journal of Positive School Psychology, 6(2), 4640–4646. Merkt, M., Weigand, S., Heier, A., & Schwan, S. (2011). Learning with videos vs. learning with print. The role of interactive features. Learning and Instruction, 21, 687–704. https:// doi.org/10.1016/j.learninstruc.2011.03.004 Mesly, O. (2015). Creating models in psychological research. Springer International Publishing. https://doi.org/10.1007/978-3-319-15753-5 Mevarech, Z. R. (1983). A Deep Structure Model of Students’ Statistical Misconceptions. Educational Studies in Mathematics, 14(4), 415–429. Meyer, J., Fleckenstein, J., & Köller, O. (2019). Expectancy value interactions and academic achievement: Differential relationships with achievement measures. Contemporary Educational Psychology, 58, 58–74. https://doi.org/10.1016/j.cedpsych.2019.01.006 Milic, N. M., Masic, S., Milin-Lazovic, J., Trajkovic, G., Bukumiric, Z., Savic, M., Milic, N. V., Cirkovic, A., Gajic, M., Kostic, M., Ilic, A., & Stanisavljevic, D. (2016). The Importance of medical students’ attitudes regarding cognitive competence for teaching applied statistics: Multi-site study and meta-analysis. PloS One, 11(10), e0164439. https://doi.org/10.1371/journal.pone.0164439 Mirriahi, N., & Dawson, S. (2013). The pairing of lecture recording data with assessment scores: A Method of discovering pedagogical impact. In D. Suthers, K. Verbert, E. Duval X. Ochoa (Eds,), Proceedings of the Third International Conference on Learning Analytics and Knowledge (pp. 180–184). New York: ACM. Moore, C., & Chung, C.-J. (2015). Students’ attitudes, perceptions, and engagement within a flipped classroom model as related to learning mathematics. Journal of Studies in Education, 5(3), 286–208. Moosbrugger, H., & Kelava, A. (2020). Qualitätsanforderungen an Tests und Fragebogen. In H. Moosbrugger & A. Kelava (Eds.), Lehrbuch. Testtheorie und Fragebogenkonstruktion (3rd ed., pp. 13–38). Springer. Moosbrugger, H., & Kelava, A. (Eds.). (2020). Lehrbuch. Testtheorie und Fragebogenkonstruktion (3rd Ed.). Springer. Moozeh, K., Farmer, J., Tihanyi, D., Nadar, T., & Evans, G. J. (2019). A prelaboratory framework toward integrating theory and utility value with laboratories: student perceptions on learning and motivation. Journal of Chemical Education, 96(8), 1548–1557. https://doi. org/10.1021/acs.jchemed.9b00107 Moradi, S., Maraghi, E., Babaahmadi, A., & Younespour, S. (2021). Application of pop quiz method in teaching biostatistics to postgraduate midwifery students and its effect on their statistics anxiety, test anxiety and academic achievement: A quasiexperimental study with control group. Journal of Biostatistics and Epidemiology. https://doi.org/10.18502/jbe. v7i2.6736 Moreno, R. (2004). Decreasing cognitive load for novice students: Effects of explanatory versus corrective feedback in discovery-based multimedia. Instructional Science, 32(1), 99–113.

388

Bibliography

Moreno, R. (2006). Does the modality principle hold for different media? A test of the method-affects-learning hypothesis. Journal of Computer Assisted Learning, 22(3), 149– 158. https://doi.org/10.1111/j.1365-2729.2006.0017.x Morris, R., Perry, T., & Wardle, L. (2021). Formative assessment and feedback for learning in higher education: A systematic review. Review of Education, 9(3). https://doi.org/10. 1002/rev3.3292 Muir, T., Milthorpe, N., Stone, C., Dyment, J., Freeman, E., & Hopwood, B. (2019). Chronicling engagement: students’ experience of online learning over time. Distance Education, 40(2), 262–277. https://doi.org/10.1080/01587919.2019.1600367 Muis, K. R., Ranellucci, J., Franco, G. M., & Crippen, K. J. (2013). The interactive effects of personal achievement goals and performance feedback in an undergraduate science class. The Journal of Experimental Education, 81(4), 556–578. https://doi.org/10.1080/ 00220973.2012.738257 Murillo-Zamorano, L. R., López Sánchez, J. Á., & Godoy-Caballero, A. L. (2019). How the flipped classroom affects knowledge, skills, and engagement in higher education: Effects on students’ satisfaction. Computers & Education, 141, 103608. https://doi.org/10.1016/ j.compedu.2019.103608 Murphy, C., & Stewart, J. (2015). The impact of online or f2f lecture choice on student achievement and engagement in a large lecture-based science course: closing the gap. Online Learning, 19(3), 91–11. https://doi.org/10.24059/olj.v19i3.670 Murphy, K. R., & Russell, C. J. (2017). Mend It or end It. Organizational Research Methods, 20(4), 549–573. https://doi.org/10.1177/1094428115625322 Murray, A. L., Obsuth, I., Eisner, M., & Ribeaud, D. (2017). Evaluating longitudinal invariance in dimensions of mental health across adolescence: An analysis of the social behavior questionnaire. Assessment, 26(7), 1234–1245. https://doi.org/10.1177/107319111772 1741 Muthén, B. [bmuthen]. (2002, June 27). Some latent variables are correlated [Online Forum Post]. http://www.statmodel.com/discussion/messages/11/186.html?1390515998 Muthén, B., Asparouhov, T., Hunter, A., & Leuchter, A. (2011). Growth modeling with non-ignorable dropout: Alternative analyses of the STAR*D Antidepressant Trail. Psychological Methods, 16(1), 17–33. https://doi.org/10.1037/a0022634 Muthén, L., & Muthén, B. (2011). Chi-Square Difference Testing Using the Satorra-Bentler Scaled Chi-Square. http://www.statmodel.com/chidiff.shtml Muthén, L., & Muthén, B. (n.d.). Latent variable interaction loop plot. https://www.statmodel.com/download/Latent variable interaction LOOP plot.pdf Muthén, B., Asparouhov, T., Hunter, A. M., & Leuchter, A. F. (2011). Growth modeling with nonignorable dropout: Alternative analyses of the STAR*D antidepressant trial. Psychological Methods, 16(1), 17–33. https://doi.org/10.1037/a0022634 Nadolny, L., & Halabi, A. (2015). Student participation and achievement in a large lecture course with game-based learning. Simulation & Gaming, 47(1), 51–72. https://doi.org/ 10.1177/1046878115620388 Nagengast, B., Marsh, H. W., Scalas, L. F., Xu, M. K., Hau, K.-T., & Trautwein, U. (2011). Who took the “x” out of expectancy-value theory? A psychological mystery, a substantive-methodological synergy, and a cross-national generalization. Psychological Science, 22(8), 1058–1066. https://doi.org/10.1177/0956797611415540

Bibliography

389

Namaziandost, E., & Çakmak, F. (2020). An account of EFL learners’ self-efficacy and gender in the Flipped Classroom Model. Education and Information Technologies, 25(5), 4041–4055. https://doi.org/10.1007/s10639-020-10167-7 Narciss, S., & Huth, K. (2004). How to design informative tutoring feedback for multimedia learning. In H. M. Niegemann, D. Leutner & R. Brünken (Eds.), Instructional design for multimedia learning (pp. 181–195). Waxmann. Narciss, S. (2008). Feedback strategies for interactive learning tasks. In J. M. Spector (Ed.), Handbook of research on educational communications and technology (3rd ed., pp. 125– 144). Erlbaum. Nasser, F. M. (2004). Structural model of the effects of cognitive and affective factors on the achievement of arabic-speaking pre-service teachers in introductory statistics. Journal of Statistics Education, 12(1). https://doi.org/10.1080/10691898.2004.11910717 Neroni, J., Meijs, C., Gijselaers, H. J.M., Kirschner, P. A., & de Groot, R. H.M. (2019). Learning strategies and academic performance in distance education. Learning and Individual Differences, 74, 1–7. https://doi.org/10.1016/j.lindif.2019.04.007 Newman, D. A. (2003). Longitudinal modeling with randomly and systematically missing data: a simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organizational Research Methods, 6(3), 328–362. https://doi.org/10.1177/109442810 3254673 Nichols, S. L., & Dawson, H. S. (2012). Assessment as a context for student engagement. In S. L. Christenson, A. L. Reschly, & C. Wylie (Eds.), Handbook of research on student engagement (pp. 457–477). Springer US. https://doi.org/10.1007/978-1-4614-2018-7_22 Niculescu, A. C., Tempelaar, D. T., Dailey-Hebert, A., Segers, M., & Gijselaers, W. H. (2016). Extending the change–change model of achievement emotions: The inclusion of negative learning emotions. Learning and Individual Differences, 47, 289–297. https:// doi.org/10.1016/j.lindif.2015.12.015 Nielsen, P. L., Bean, N. W., & Larsen, R. A. A. (2018). The impact of a flipped classroom model of learning on a large undergraduate statistics class. Statistics Education Research Journal, 17(1), 121–14. https://doi.org/10.52041/serj.v17i1.179 Nolan, M. M., Beran, T., & Hecker, K. G. (2012). Surveys assessing students’ attitudes toward statistics: A systematic review of validity and reliability. Statistics Education Research Journal, 11(2), 103–123. https://doi.org/10.52041/serj.v11i2.333 Núñez-Peña, M. I., Bono, R., & Suárez-Pellicioni, M. (2015). Feedback on students’ performance: A possible way of reducing the negative effect of math anxiety in higher education. International Journal of Educational Research, 70, 80–87. https://doi.org/10. 1016/j.ijer.2015.02.005 Nunnally, J. C., & Bernstein, I. H. (1993). Psychometric theory. McGraw-Hill Professional. Nuthall, G. A. (2005). The cultural myths and realities of classroom teaching and learning: A personal journey. Teachers College Record, 107(5), 895– 934. https://doi.org/10.1111/ j.1467-962.2005.00498.x Ocker, R., & Yaverbaum, G. (1999). Asynchronous computer-mediated communication versus face-to-face collaboration: results on student learning, quality and satisfaction. Group Decision and Negotiation, 8, 427–44. OECD (2015). The ABC of gender equality in education: Aptitude, behaviour, confidence, PISA. OECD Publishing.

390

Bibliography

OECD. (2014). Was Schülerinnen und Schüler wissen und können: Schülerleistungen in Lesekompetenz, Mathematik und Naturwissenschaften; [programme for international student assessment’s (überarb. Ausg). PISA: Bd. 1. OECD; Bertelsmann. https://doi.org/10. 1787/9789264208858-de O’Flaherty, J.& Phillips, C. (2015): The use of flipped classrooms in higher education: A scoping review. The Internet and Higher Education, 25, 85–95. https://doi.org/10.1016/j. iheduc.2015.02.002 Onwuegbuzie, A. J. (2004). Academic procrastination and statistics anxiety. Assessment and Evaluation in Higher Education, 29, 3–19. https://doi.org/10.1080/026029304200016 0384 Onwuegbuzie, A.J. (2003). Modeling statistics achievement among graduate students. Educational and Psychological Measurement, 63(6), 1020–1038. Onwuegbuzie, A. J., & Wilson, V. A. (2003). Statistics Anxiety: Nature, etiology, antecedents, effects, and treatments--a comprehensive review of the literature. Teaching in Higher Education, 8(2), 195–209. https://doi.org/10.1080/1356251032000052447 Opstad, L. (2020). Attitudes towards statistics among business students: do gender, mathematical skills and personal traits matter? Sustainability, 12(15), 6104. https://doi.org/10. 3390/su12156104 Ozkok, O., Zyphur, M. J., Barsky, A. P., Theilacker, M., Donnellan, M. B., & Oswald, F. L. (2019). Modeling measurement as a sequential process: Autoregressive confirmatory factor analysis (AR-CFA). Frontiers in Psychology, 1–19. https://doi.org/10.3389/fpsyg. 2019.02108 Paas, F., Renkl, A., & Sweller, J. (2003). Cognitive load theory and instructional design: Recent developments. Educational Psychologist, 38(1), 1–4. Pablo, N., & Chance, B. (2018). Can a simulation-based inference course be flipped? In M. A. Sorto, A. White, & L. Guyot (Eds.), Looking back, looking forward. Proceedings of the Tenth International Conference on Teaching Statistics. International Statistics Institute. Pahljina-Reini´c, R., & Koli´c-Vehovec, S. (2017). Average personal goal pursuit profile and contextual achievement goals: Effects on students’ motivation, achievement emotions, and achievement. Learning and Individual Differences, 56, 167–174. https://doi.org/10. 1016/j.lindif.2017.01.020 Pajares, F. (2002). Gender and Perceived Self-Efficacy in Self-Regulated Learning. Theory Into Practice, 41(2), 116–125. https://doi.org/10.1207/s15430421tip4102_8 Panadero, E. (2017). A review of self-regulated learning: Six models and four directions for research. Frontiers in Psychology, 8. https://doi.org/10.3389/fpsyg.2017.00422 Panadero, E., & Lipnevich, A. A. (2022). A review of feedback models and typologies: Towards an integrative model of feedback elements. Educational Research Review, 35, 100416. https://doi.org/10.1016/j.edurev.2021.100416 Parr, A., Amemiya, J., & Wang, M.-T. (2019). Student learning emotions in middle school mathematics classrooms: investigating associations with dialogic instructional practices. Educational Psychology, 39(5), 636–658. https://doi.org/10.1080/0144341.2018. 1560395 Patzelt, J, & Opitz, I. (2014). Deutsche Version der Aitken Procrastination Scale (APS-d) [= German version of the APS]. https://zis.gesis.org/skala/Patzelt-Opitz-Deutsche-Versionder-Aitken-Procrastination-Scale-(APS-d) Accessed 05 April 202.

Bibliography

391

Paul, W. (2017). An exploration of student attitudes and satisfaction in a gaise-influenced introductory statistics course. Statistics Education Research Journal, 16(2), 487–51. https://doi.org/10.52041/serj.v16i2.203 Peixoto, F., Mata, L., Monteiro, V., Sanches, C., & Pekrun, R. (2015). The Achievement Emotions Questionnaire: Validation for pre-adolescent students. European Journal of Developmental Psychology, 12(4), 472–481. Peixoto, F., Sanches, C., Mata, L., & Monteiro, V. (2017). “How do you feel about math?”: relationships between competence and value appraisals, achievement emotions and academic achievement. European Journal of Psychology of Education, 32(3), 385–405. https://doi.org/10.1007/s10212-016-0299-4 Pekrun, R. (2006). The control-value theory of achievement emotions: Assumptions, corollaries, and implications for educational research and practice. Educational Psychology Review, 18(4), 315–341. Pekrun, R., Goetz, T., Frenzel, A. C., Barchfeld, P., & Perry, R. P. (2010). Measuring emotions in students‘ learning and performance: The Achievement Emotions Questionnaire (AEQ). Contemporary Educational Psychology, 36, 36–48. Pekrun, R., Goetz, T., Perry, R. P., Kramer, K., Hochstadt, M., & Molfenter, S. (2007). Beyond test anxiety: Development and validation of the Test Emotions Questionnaire (TEQ). Anxiety, Stress & Coping: An International Journal, 17(3), 287–316. Pekrun, R., Goetz, T., Titz, W., & Perry, R. P. (2002). Academic emotions in students’ selfregulated learning and achievement. Educational Psychologist, 37(2), 91–106. Pekrun, R., Lichtenfeld, S., Marsh, H. W., Murayama, K., & Goetz, T. (2017). Achievement emotions and academic performance: Longitudinal models of reciprocal effects. Child Development, 88(5), 1653–1670. Pekrun, R. (2006). The control-value theory of achievement emotions: assumptions, corollaries, and implications for educational research and practice. Educational Psychology Review, 18(4), 315–341. https://doi.org/10.1007/s10648-006-9029-9 Pekrun, R. (2007). Emotions in students’ scholastic development. In R. P. Perry & J. C. Smart (Eds.), The scholarship of teaching and learning in higher education: An evidence-based perspective (1st ed., pp. 553–610). Springer. Pekrun, R. (2017). Emotion and achievement during adolescence. Child Development Perspectives, 11(3), 215–221. https://doi.org/10.1111/cdep.12237 Pekrun, R. (2018). Emotion, Lernen und Leistung. In M. Huber & S. Krause (Eds.), Bildung und Emotion. Springer VS, 2018. Pekrun, R., & Linnenbrink-Garcia, L. (2012). Academic emotions and student engagement. In S. L. Christenson, A. L. Reschly, & C. Wylie (Eds.), Handbook of research on student engagement (pp. 259–282). Springer US. https://doi.org/10.1007/978-1-4614-2018-7_12 Pekrun, R., & Stephens, E. J. (2010). Achievement emotions in higher education. In J. C. Smart (Ed.), Higher Education: Handbook of Theory and Research. Higher Education: Handbook of Theory and Research (Vol. 25, pp. 257–306). Springer Netherlands. https:// doi.org/10.1007/978-90-481-8598-6_7 Pekrun, R., Cusack, A., Murayama, K., Elliot, A. J., & Thomas, K. (2014). The power of anticipated feedback: Effects on students’ achievement goals and achievement emotions. Learning and Instruction, 29, 115–124. https://doi.org/10.1016/j.learninstruc.2013. 09.002

392

Bibliography

Pekrun, R., Goetz, T., Frenzel, A. C., Barchfeld, P., & Perry, R. P. (2011). Measuring emotions in students’ learning and performance: The Achievement Emotions Questionnaire (AEQ). Contemporary Educational Psychology, 36(1), 36–48. https://doi.org/10.1016/j. cedpsych.201.10.002 Pekrun, R., Goetz, T., Perry, R. P., Kramer, K., Hochstadt, M., & Molfenter, S. (2004). Beyond test anxiety: Development and validation of the test emotions questionnaire (TEQ). Anxiety, Stress & Coping, 17(3), 287–316. https://doi.org/10.1080/106158004123 31303847 Pekrun, R., Goetz, T., Titz, W., & Perry, R. P. (2002). Academic emotions in students’ selfregulated learning and achievement: a program of qualitative and quantitative research. Educational Psychologist, 37(2), 91–105. https://doi.org/10.1207/S15326985EP3702_4 Pekrun, R., Lichtenfeld, S., Marsh, H. W., Murayama, K., & Goetz, T. (2017). Achievement emotions and academic performance: Longitudinal models of reciprocal effects. Child Development, 88(5), 1653–1670. https://doi.org/10.1111/cdev.12704 Péladeau, N., Forget, J., & Gagné, F. (2003). Effect of paced and unpaced practice on skill application and retention: How much is enough? American Educational Research Journal, 40(3), 769– 801. https://doi.org/10.3102/00028312040003769 Pellegrino, J.W. (2010). The design of an assessment system for the race to the top: A learning sciences perspective on issues of growth and measurement. https://www.ets.org/ Media/Research/pdf/PellegrinoPresenterSession1.pdf Pentz, M. A., & Chou, C.-P. (1994). Measurement invariance in longitudinal clinical research assuming change from development and intervention. Journal of Consulting and Clinical Psychology, 62(3), 450–462. https://doi.org/10.1037/0022-006x.62.3.450 Perepiczka, M., Chandler, N., & Becerra, M. (2011). Relationship graduate students’ statistics self-efficacy, statistics anxiety, attitude toward statistics, and social support. The Professional Counselor, 1(2), 99–108. https://doi.org/10.15241/mpa.1.2.99 Perez, T., Dai, T., Kaplan, A., Cromley, J. G., Brooks, W. D., White, A. C., Mara, K. R., & Balsai, M. J. (2019). Interrelations among expectancies, task values, and perceived costs in undergraduate biology achievement. Learning and Individual Differences, 72, 26–38. https://doi.org/10.1016/j.lindif.2019.04.001 Perry, R. P., & Smart, J. C. (Eds.). (2007). The scholarship of teaching and learning in higher education: An evidence-based perspective (1. ed.). Springer. Persson, I., Kraus, K., Hansson, L., & Wallentin, F. Y. (2019). Confirming the structure of the survey of attitudes toward statistics (sats-36) by swedish students. Statistics Education Research Journal, 18(1), 83–93. https://doi.org/10.52041/serj.v18i1.151 Peterson, E. R., Brown, G. T., & Jun, M. C. (2015). Achievement emotions in higher education: A diary study exploring emotions across an assessment event. Contemporary Educational Psychology, 42, 82–96. https://doi.org/10.1016/j.cedpsych.2015.05.002 Peterson, M. L. (1975). Educational programs for team delivery. Interdisciplinary education of health associates: The Johns Hopkins experience. Journal of Medical Education, 50(12 pt 2), 111–117. https://doi.org/10.1097/00001888-197512000-00015 Pfennig, A. (2020). Matching course assessment of a first year material science course to the blended-learning teaching approach. International Journal of e-Education, e-Business, e-Management and e-Learning, 10(1), 53–59. https://doi.org/10.17706/ijeeee.202.1.1. 53-59

Bibliography

393

Pintrich, P. R., Smith, D. A. F., Garcia, T., & McKeachie, W. J. (1991). A Manual for the Use of the Motivated Strategies for Learning Questionnaire (MSLQ). University of Michigan, National Center for Research to Improve Postsecondary Teaching and Learning. Pintrich, P. R. (2004). A conceptual framework for assessing motivation and self-regulated learning in college students. Educational Psychology Review, 16(4), 385–407. https://doi. org/10.1007/s10648-004-0006-x Pintrich, P. R., & de Groot, E. V. (1990). Motivational and self-regulated learning components of classroom academic performance. Journal of Educational Psychology, 82(1), 33–4. https://doi.org/10.1037/0022-0663.82.1.33 Podsakoff, P. M., MacKenzie, S. B., Lee, J.-Y., & Podsakoff, N. P. (2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. The Journal of Applied Psychology, 88(5), 879–903. https://doi.org/10.1037/0021901.88.5.879 Poljicanin, A., Caric, A., Vilovic, K., Kosta, V., Guic, M. M., Aljinovic, J., & Grkovic, I. (2009). Daily mini quizzes as means for improving student performance in anatomy course. Medical Education, 50, 55–6. Pritchard, R. D., Young, B. Y., Koenig, N. Schmerling, D., & Wright Dixon, N. (2013). Long-term effects of goal setting on performance with the productivity measurement and enhancement system (ProMES). In E. A. Locke & G. P. Latham (Eds.), New Developments in goal setting and task performance (pp. 233–245). Routledge. Pritikin, J. N., Brick, T. R., & Neale, M. C. (2018). Multivariate normal maximum likelihood with both ordinal and continuous variables, and data missing at random. Behavior Research Methods, 50(2), 490–500. https://doi.org/10.3758/s13428-017-1011-6 Putwain, D. W., Larkin, D., & Sander, P. (2013). A reciprocal model of achievement goals and learning related emotions in the first year of undergraduate study. Contemporary Educational Psychology, 38(4), 361–374. https://doi.org/10.1016/j.cedpsych.2013.07.003 Putwain, D. W., Sander, P., & Larkin, D. (2013). Using the 2×2 framework of achievement goals to predict achievement emotions and academic performance. Learning and Individual Differences, 25, 80–84. https://doi.org/10.1016/j.lindif.2013.01.006 Quilici, J. L., & Mayer, R. E. (2002). Teaching students to recognize structural similarities between statistics word problems. Applied Cognitive Psychology, 16(3), 325–342. https:// doi.org/10.1002/acp.796 Raman, M., Mclaughlin, K., Violato, C., Rostom, A., Allard, J., & Coderre, S. (2010). Teaching in small portions dispersed over time enhances long-term knowledge retention. Medical Teacher, 32(3), 250–255. https://doi.org/10.3109/01421590903197019 Ramirez, C., Schau, C., & Emmioglu, E. (2012). The importance of attitudes in statistics education. Statistics Education Research Journal, 11(2), 57–71. Randhawa, B. S., Beamer, J. E., & Lundberg, I. (1993). Role of mathematics self-efficacy in the structural model of mathematics achievement. Journal of Educational Psychology, 85(1), 41–48. https://doi.org/10.1037/0022-0663.85.1.41 Ranellucci, J., Robinson, K. A., Rosenberg, J. M., Lee, Y [You-kyung], Roseth, C. J., & Linnenbrink-Garcia, L. (2021). Comparing the roles and correlates of emotions in class and during online video lectures in a flipped anatomy classroom. Contemporary Educational Psychology, 65. https://doi.org/10.1016/j.cedpsych.2021.101966

394

Bibliography

Rausch, A., Kögler, K., & Seifried, J. (2019). Validation of Embedded Experience Sampling (EES) for measuring non-cognitive facets of problem-solving competence in scenariobased assessments. Frontiers in Psychology, 10, 120. https://doi.org/10.3389/fpsyg.2019. 01200 Razzaq, R., Ostrow, K. S., & Heffernan, N. T. (2020). Effect of immediate feedback on math achievement at the high school level. In Lecture notes in computer science (pp. 263–267). Springer International Publishing. https://doi.org/10.1007/978-3-030-52240-7_48 Reading, C. (2002). Profile for statistical understanding [Paper presentation]. ICOTS6, Cape Town, South Africa. Reeve, J. (2012). A self-determination theory perspective on student engagement. In S. L. Christenson, A. L. Reschly, & C. Wylie (Eds.), Handbook of research on student engagement (pp. 149–172). Springer US. https://doi.org/10.1007/978-1-4614-2018-7_7 Resnik, P., & Dewaele, J.-M. (2021). Learner emotions, autonomy and trait emotional intelligence in ‘in-person’ versus emergency remote English foreign language teaching in Europe. Applied Linguistics Review. https://doi.org/10.1515/applirev-2020-0096 Respondek, L., Seufert, T., & Nett, U. E. (2019). Adding previous experiences to the personsituation debate of achievement emotions. Contemporary Educational Psychology, 58, 19–32. Richard, E. M., Diefendorff, J. M., & Martin, J. H. (2006). Revisiting the within-person selfefficacy and performance relation. Human Performance, 19(1), 67–87. https://doi.org/10. 1207/s15327043hup1901_4 Richter, D., Lehrl, S., & Weinert, S. (2015). Enjoyment of learning and learning effort in primary school: The significance of child individual characteristics and stimulation at home and at preschool. Early Child Development and Care, 186(1), 96–116. https://doi.org/10. 1080/03004430.2015.1013950 Riegel, K., & Evans, T. (2021). Student achievement emotions: Examining the role of frequent online assessment. Australasian Journal of Educational Technology, 75–87. https:// doi.org/10.14742/ajet.6516 Ringle, C. M., & Spreen, F. (2007). Beurteilung der Ergebnisse von PLS-Pfadanalysen. Das Wirtschaftsstudium, 36(2), 211–216. Rios, J., & Wells, C. (2014). Validity evidence based on internal structure. Psicothema, 26(1), 108–116. https://doi.org/10.7334/psicothema2013.260 Roberts, D. M., & Bilderback, E. W. (1980). Reliability and validity of a statistics attitude survey. Educational and Psychological Measurement, 40(1), 235–238. https://doi.org/10. 1177/001316448004000138 Rodarte-Luna, B., & Sherry, A. (2008). Sex differences in the relation between statistics anxiety and cognitive/learning strategies. Contemporary Educational Psychology, 33(2), 327–344. https://doi.org/10.1016/j.cedpsych.2007.03.002 Römmer-Nossek, B., Peschl, M. F., & Zimmermann, E. (2013). Kognitionswissenschaft. Ihre Perspektive auf Lernen und Lehren mit Technologien. In M. Ebner & S. Schön (Eds.), Lehrbuch für Lernen und Lehren mit Technologien (2nd Ed., p.374–386). epubli. Rosenzweig, E. Q., & Wigfield, A. (2016). STEM motivation interventions for adolescents: a promising start, but further to go. Educational Psychologist, 51(2), 146–163. https://doi. org/https://doi.org/10.1080/0046152.2016.1154792 Rosenzweig, E. Q., Wigfield, A., & Hulleman, C. S. (2020). More useful or not so bad? Examining the effects of utility value and cost reduction interventions in college physics.

Bibliography

395

Journal of Educational Psychology, 112(1), 166–182. https://doi.org/10.1037/edu000 0370 Ross, B., Chase, A.-M., Robbie, D., Oates, G., & Absalom, Y. (2018). Adaptive quizzes to increase motivation, engagement and learning outcomes in a first year accounting unit. International Journal of Educational Technology in Higher Education, 15(1). https://doi. org/10.1186/s41239-018-0113-2 Rotenstein, A., Davis, H. Z., & Tatum, L. (2009). Early birds versus just-in-timers: the effect of procrastination on academic performance of accounting students. Journal of Accounting Education, 27(4), 223–232. https://doi.org/10.1016/j.jaccedu.201.08.001 Rožman, L., Lešer, V. J., Širca, N. T., Dermol, V., & Skrbinjek, V. (2014, June). Assessing student workload [Paper presentation]. Management, Knowledge and Learning International Conference, Portorož, Slovenia. Ruggeri, K., Díaz, C., Kelley K., Papousek, I., Dempster, M., & Hanna, D. (2008). International issues in education. Psychology Teaching Review 14(2), 65–74. Rummler, K. (Ed.). (2014). Medien in der Wissenschaft: Bd. 67. Lernräume gestalten—Bildungskontexte vielfältig denken. Waxmann. https://doi.org/31423 Rumsey, D. J. (2002). Statistical literacy as a goal for introductory statistics courses. Journal of Statistics Education, 10(3). https://doi.org/10.1080/10691898.2002.11910678 Rüth, M., Breuer, J., Zimmermann, D., & Kaspar, K. (2021). The effects of different feedback types on learning with mobile quiz apps. Frontiers in Psychology, 12, 665144. https://doi. org/10.3389/fpsyg.2021.665144 Rutkowski, D., & Wild, J. (2015). Stakes matter: student motivation and the validity of student assessments for teacher evaluation. Educational Assessment, 20(3), 165–179. https:// doi.org/10.1080/10627197.2015.1059273 Ryan, M., & Reid, S. (2015). Impact of the flipped classroom on student performance and retention: a parallel controlled study in general chemistry. J. Chem. Educ, 93(1), 13–23. https://doi.org/10.1021/acs.jchemed.5b00717 Ryan, R. M., & Deci, E. L. (2016): Facilitating and Hindering Motivation, Learning, and Well-Being in Schools. In K. R. Wentzel & D. B. Miele (Eds.), Handbook of Motivation at School (2nd Ed., p. 96–119). Routledge. Sabbag, A., Garfield, J., & Zieffler, A. (2018). Assessing statistical literacy and statistical reasoning: The reali instrument. Statistics Education Research Journal, 17(2), 141–160. https://doi.org/10.52041/serj.v17i2.163 Salzmann, P. (2015). Lernen durch kollegiales Feedback: Die Sicht von Lehrpersonen und Schulleitungen in der Berufsbildung [= Learning by cooperative feedback. The perspective of teachers and school administrators in vocational education.] Münster, New York: Waxmann. Sancho-Vinuesa, T., Escudero-Viladoms, N., & Masià, R. (2013). Continuous activity with immediate feedback: a good strategy to guarantee student engagement with the course. Open Learning: The Journal of Open, Distance and E-Learning, 28(1), 51–66. https:// doi.org/10.1080/02680513.2013.776479 Sandoz, E. K., Butcher, G., & Protti, T. A. (2017). A preliminary examination of willingness and importance as moderators of the relationship between statistics anxiety and performance. Journal of Contextual Behavioral Science, 6(1), 47–52. https://doi.org/10.1016/j. jcbs.2017.02.002

396

Bibliography

Satorra, A., & Bentler, P.M. (2010). Ensuring positiveness of the scaled difference chi-square test statistic. Psychometrika, 75, 243–248. Savalei, V. (2010). Expected versus observed information in SEM with incomplete normal and nonnormal data. Psychological Methods, 15(4), 352–367. https://doi.org/10.1037/a00 20143 Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177. https://doi.org/10.1037//1082-989X.7.2.147 Schau, C. (2003, August). Students’ attitudes: the “other” important outcome in statistics education [Paper presentation]. Joint Statistics Meetings, San Francisco, United States. Schau, C., & Emmioglu (2012). Do introductory statistics courses in the united states improve students’ attitudes? Statistics Education Research Journal, 10(1), 35–51. Schau, C., Stevens, J., Dauphinee, T. L., & Del Vecchio, A. (1995). The development and validation of the survey of antitudes toward statistics. Educational and Psychological Measurement, 55(5), 868–875. https://doi.org/10.1177/0013164495055005022 Schiefele, U., & Csikszentmihalyi, M. (1995). Motivation and ability as factors in mathematics experience and Achievement. Journal for Research in Mathematics Education, 26(2), 163–181. Schlomer, G. L., Bauman, S., & Card, N. A. (2010). Best practices for missing data management in counseling psychology. Journal of Counseling Psychology, 57(1). https://doi.org/ 10.1037/a0018082 Schmidt, H., Wagener, S., Guus, S., Keemink, L., & van der Molen, H. (2015). On the Use and Misuse of Lectures in Higher Education. Health Professions Education, 1(1), 12–18. https://doi.org/10.1016/j.hpe.2015.11.010 Schmitt, T. A., & Sass, D. A. (2011). Rotation criteria and hypothesis testing for exploratory factor analysis: Implications for factor pattern loadings and interfactor correlations. Educational and Psychological Measurement, 71(1), 95–113. https://doi.org/10.1177/001316 4410387348 Schrader, C., & Grassinger, R. (2021). Tell me that I can do it better. The effect of attributional feedback from a learning technology on achievement emotions and performance and the moderating role of individual adaptive reactions to errors. Computers & Education, 161. https://doi.org/10.1016/j.compedu.202.104028 Schram, C. M. (1996). A meta-analysis of gender differences in applied statistics achievement. Journal of Educational and Behavioral Statistics, 21(1), 55–7. Schultz, D., Duffield, S., Rasmussen, S. C., & Wageman, J. (2014). Effects of the flipped classroom model on student performance for advanced placement high school chemistry students. Journal of Chemical Education, 91(9), 1334–1339. https://doi.org/10.1021/ed4 00868x Schunk, D. H. (1983). Ability versus effort attributional feedback: Differential effects on self-efficacy and achievement. Journal of Educational Psychology, 75, 848–856. Schunk, D. H. (1991). Self-Efficacy and academic motivation. Educational Psychologist, 26(3), 207–231. https://doi.org/10.1207/s15326985ep2603&4_2 Schunk, D. H. (1989). Self-efficacy and achievement behaviors. Educational Psychology Review, 1(3), 173–208. https://doi.org/10.1007/bf01320134 Schunk, D. H., & Ertmer, P. A. (2000). Self-Regulation and academic learning. In Handbook of Self-Regulation (pp. 631–649). Elsevier. https://doi.org/10.1016/b978-012109890-2/ 50048-2

Bibliography

397

Schunk, D. H., & Lilly, M. W. (1984). Sex differences in self-efficacy and attributions: Influence of performance feedback. The Journal of Early Adolescence, 4(3), 203–213. https:// doi.org/10.1177/0272431684043004 Seegers, G., & Boekaerts, M. (1996). Gender-Related differences in self-referenced cognitions in relation to mathematics. Journal for Research in Mathematics Education, 27(2), 215–24. https://doi.org/10.5951/jresematheduc.27.2.0215 Seifried, J. (2003). Der Zusammenhang zwischen emotionalem, motivationalem und kognitivem Erleben in einer selbstorganisationsoffenen Lernumgebung—Eine prozessuale Analyse des subjektiven Erlebens im Rechnungswesenunterricht. In J. Buer., & O. Zlatkin-Troitschanskaia (Eds.), Berufliche Bildung auf dem Prüfstand (pp. 207–227). Peter Lang. Seifried, J., & Sembill, D. (2005). Emotionale Befindlichkeit in Lehr-Lern-Prozessen in der beruflichen Bildung. Zeitschrift für Pädagogik 51(5), 656–672. Self, S. (2013). Utilizing online tools to measure effort: Does it really improve student outcome? International Review of Economics Education, 14, 36–45. https://doi.org/10.1016/ j.iree.2013.03.001 Semb, G. B., & Ellis, J. A. (1994). Knowledge taught in school: What is remembered? Review of Educational Research, 64, 253–286. https://doi.org/10.3102/003465430640 02253 Sembill, D., Wuttke, E., Seifried, J., Egloffstein, M., & Rausch, A. (2008). Selbstorganisiertes Lernen in der beruflichen Bildung Abgrenzungen, Befunde und Konsequenzen. Bibliothek der Universität Konstanz. https://doi.org/68217 Seo, E. H. (2012). Cramming, active procrastination, and academic achievement. Social Behavior and Personality: An International Journal, 40(8), 1333–134. https://doi.org/ 10.2224/sbp.2012.4.8.1333 Sesé, A., Jiménez, R., Montaño, J. J., & Palmer, A. (2015). Can attitudes toward statistics and statistics anxiety explain students’ performance? Revista De Psicodidactica / Journal of Psychodidactics, 20(2), 285–304. https://doi.org/https://doi.org/10.1387/RevPsicodidact. 13080 Shahirah, S., & Moi, N. (2019). Investigating the validity and reliability of survey attitude towards statistics instrument among rural secondary school students. International Journal of Educational Methodology, 5(4), 651–661. https://doi.org/10.12973/ijem.5. 4.651 Shao, K., Pekrun, R., Marsh, H. W., & Loderer, K. (2020). Control-value appraisals, achievement emotions, and foreign language performance: A latent interaction analysis. Learning and Instruction, 69, 1–59. Sharma, A. M., & Srivastav, A. (2021). Study to assess attitudes towards statistics of business school students: an application of the sats-36 in India. International Journal of Instruction, 14(3), 207–222. https://doi.org/10.29333/iji.2021.14312a Shi, D., Lee, T., Fairchild, A. J., & Maydeu-Olivares, A. (2019). Fitting ordinal factor analysis models with missing data: A comparison between pairwise deletion and multiple imputation. Educational and Psychological Measurement, 80(1), 41–66. https://doi.org/ 10.1177/0013164419845039 Shi, L., Cristea, A. I., Hadzidedic, S., & Dervishalidovic, N. (2014). Contextual gamification of social interaction—towards increasing motivation in social e-learning. In Advances

398

Bibliography

in Web-Based Learning—ICWL 2014 (p.116–122). Springer International Publishing. https://doi.org/10.1007/978-3-319-09635-3_12 Shinaberger, L. (2017). Components of a flipped classroom influencing student success in an undergraduate business statistics course. Journal of Statistics Education, 25(3), 122–13. https://doi.org/10.1080/10691898.2017.1381056 Shirvani, H. (2009). Examining an assessment strategy on High school mathematics achievement: Daily quizzes vs. weekly tests. American Secondary Education, 30(1), 34–45. Showalter, D. A. (2021). Attitudinal changes in face-to-face and online statistical reasoning learning environments. Journal of Pedagogical Research, 5(2). https://doi.org/10.33902/ JPR.2021269257 Shute, V. J. (2008). Focus on formative Feedback. Review of Educational Research, 78(1), 153–189. Sireci, S., & Faulkner-Bond, M. (2014). Validity evidence based on test content. Psicothema, 26(1), 100–107. https://doi.org/10.7334/psicothema2013.256 Slootmaeckers, K., Kerremans, B., & Adriaensen, J. (2014). Too afraid to learn: Attitudes towards statistics as a barrier to learning statistics and to acquiring quantitative skills. Politics, 34(2), 191–20. https://doi.org/10.1111/1467-9256.12042 Smart, J. C. (Ed.). (2010). Higher Education: Handbook of Theory and Research. Higher Education: Handbook of Theory and Research. Springer Netherlands. https://doi.org/10. 1007/978-90-481-8598-6 Smith, T. (2017). Gamified modules for an introductory statistics course and their impact on attitudes and learning. Simulation & Gaming, 48(6), 832–854. https://doi.org/10.1177/ 1046878117731888 Smith, T. M. (2008). An investigation into student understanding of statistical hypothesis testing [College Park, Md.: University of Maryland]. http://hdl.handle.net/1903/8565 Smits, M. H., Boon, J., Sluijsmans, D. M., & van Gog, T. (2008). Content and timing of feedback in a web-based learning environment: effects on learning as a function of prior knowledge. Interactive Learning Environments, 16(2), 183–193. https://doi.org/10.1080/ 10494820701365952 Snyder, L. G., & Snyder, M. J. (2008). Teaching critical thinking and problem-solving skills. The Delta Pi Epsilon Journal, 2, 90–99. So, C. (2010). Making Software Teams Effective: How Agile Practices Lead to Project Success Through Teamwork Mechanisms (1st Ed.). Peter Lang GmbH, Internationaler Verlag der Wissenschaften. Soe, H., Khobragade, S., Lwin, H. Htay, M., Than, N., Phyu, K., & Abas, A. (2021). Learning statistics: interprofessional survey of attitudes toward statistics using SATS-36. Dentistry and Medical Research, 9, 121–125. Song, J., & Chung, Y. (2020). Reexamining the interaction between expectancy and task value in academic settings. Learning and Individual Differences, 78, 101839. https://doi. org/10.1016/j.lindif.202.101839 Spangler, G., Pekrun, R., Kramer, K., & Hofmann, H. (2010). Students‘ emotions, physiological reactions, and coping in academic exams. Anxiety, Stress & Coping: An International Journal, 15(4), 413–432. Spector, J. M. (Ed.). (2008). Handbook of research on educational communications and technology (3. ed.). Erlbaum.

Bibliography

399

Spence, J. T. (Ed.). (1983). (A Series of books in psychology). Achievement and achievement motives: Psychological and sociological approaches. W.H. Freeman. Stanisavljevic, D., Trajkovic, G., Marinkovic, J., Bukumiric, Z., Cirkovic, A., & Milic, N. (2014). Assessing attitudes towards statistics among medical students: Psychometric properties of the Serbian version of the Survey of Attitudes Towards Statistics (SATS). PloS One, 9(11), e112567. https://doi.org/10.1371/journal.pone.0112567 Starkey-Perret, R., Deledalle, A., Jeoffrion, C., & Rowe, C. (2018). Measuring the impact of teaching approaches on achievement-related emotions: The use of the Achievement Emotions Questionnaire. The British Journal of Educational Psychology, 88(3), 446–464. https://doi.org/10.1111/bjep.12193 Steele, C. (1988). The psychology of self-affirmation. Sustaining the integrity of the self. Advances in Experimental Social Psychology 21, 261–302. Steenkamp, J.-B. E. M., & Maydeu-Olivares, A. (2020). An updated paradigm for evaluating measurement invariance incorporating common method variance and its assessment. Journal of the Academy of Marketing Science, 49(1), 5–29. https://doi.org/10.1007/s11 747-020-00745-z Steenkamp, J.-B. E. M., & Baumgartner, H. (1998). Assessing measurement invariance in cross-national consumer research. Journal of Consumer Research, 25(1), 78–107. https:// doi.org/10.1086/209528 Stern, J., Ferraro, K., & Mohnkern, J. (2017). Tools for teaching conceptual understanding, secondary: Designing lessons and assessments for deep learning. Corwin. Stipek, D. J., & Gralinski, J. H. (1991). Gender differences in children’s achievement-related beliefs and emotional responses to success and failure in mathematics. Journal of Educational Psychology, 83(3), 361–371. https://doi.org/10.1037/0022-0663.83.3.361 Stone, A. (2006). A psychometric analysis of the statistics concept inventory (Doctoral dissertation). Retrieved from UMI. (3208004) Stone, A., Allen, K., Rhoads, T. R., Murphy, T. J., Shehab, R. L., & Saha, C. (2003). The statistics concept inventory: A pilot study. In 33rd Annual Frontiers in Education, 2003. FIE 2003. IEEE. https://doi.org/10.1109/fie.2003.1263336 Street, S. E., Gilliland, K. O., McNeil, C., & Royal, K. (2015). The flipped classroom improved medical student performance and satisfaction in a pre-clinical physiology course. Medical Science Educator, 25(1), 35–43. https://doi.org/10.1007/s40670-0140092-4 Suh, Y. (2015). The performance of maximum likelihood and weighted least square mean and variance adjusted estimators in testing differential item functioning with nonnormal trait distributions. Structural Equation Modeling: A Multidisciplinary Journal, 22(4), 568– 580. https://doi.org/10.1080/10705511.2014.937669 Swann, W. B., Chang-Schneider, C., & Larsen McClarty, K. (2007). Do people’s self-views matter? Self-concept and self-esteem in everyday life. The American Psychologist, 62(2), 84–94. https://doi.org/10.1037/0003-066X.62.2.84 Swann, W. B., Griffin, J. J., Predmore, S. C., & Gaines, B. (1987). The cognitive–affective crossfire: When self-consistency confronts self-enhancement. Journal of Personality and Social Psychology, 52(5), 881–889. https://doi.org/10.1037/0022-3514.52.5.881 Sweller, J., Ayres, P., & Kalyuga, S. (2011). Cognitive load theory. Springer Science+Business Media.

400

Bibliography

Takase, M., Niitani, M., Imai, T., & Okada, M. (2019). Students’ perceptions of teaching factors that demotivate their learning in lectures and laboratory-based skills practice. International Journal of Nursing Sciences, 6, 414–42. https://doi.org/10.1016/j.ijnss. 2019.08.001 Talsma, K., Schüz, B., & Norris, K. (2019). Miscalibration of self-efficacy and academic performance: Self-efficacy = self-fulfilling prophecy. Learning and Individual Differences, 69, 182–195. https://doi.org/10.1016/j.lindif.2018.11.002 Talsma, K., Schüz, B., Schwarzer, R., & Norris, K. (2018). I believe, therefore I achieve (and vice versa): A meta-analytic cross-lagged panel analysis of self-efficacy and academic performance. Learning and Individual Differences, 61, 136–15. https://doi.org/10.1016/j. lindif.2017.11.015 Taylor, G., Jungert, T., Mageau, G. A., Schattke, K., Dedic, H., Rosenfield, S., & Koestner, R. (2014). A self-determination theory approach to predicting school achievement over time: the unique role of intrinsic motivation. Contemporary Educational Psychology, 39(4), 342–358. https://doi.org/10.1016/j.cedpsych.2014.08.002 Tempelaar, D. (2004). Statistical reasoning assessment: an analysis of the SRA instrument [Paper presentation]. ARTIST Roundtable Conference on Assessment in Statistics, Appleton. Tempelaar, D. T., Gijselaers, W. J., & Schim van der Loeff, S. (2006). Puzzles in statistical reasoning. Journal of Statistics Education, 14(1). Tempelaar, D., & van der Loeff, S. (2011, August). The development of students’ subject attitudes when taking a statistics course [Paper presentation]. 58th World Statistics Congress of the International Statistics Institute, Dublin, Ireland. Tempelaar, T., Schim van der Loeff, S., Gijselaers, W., & Nijhuis, J. (2011). On subject variations in achievement motivations: A study in business subjects. Research in Higher Education, 52, 395–419. Tempelaar, D. T., Gijselaers, W. H., van der Schim Loeff, S., & Nijhuis, J. F. (2007). A structural equation model analyzing the relationship of student achievement motivations and personality factors in a range of academic subject-matter areas. Contemporary Educational Psychology, 32(1), 105–131. https://doi.org/10.1016/j.cedpsych.2006.1.004 Tempelaar, D. T., Niculescu, A., Rienties, B., Gijselaers, W. H., & Giesbers, B. (2012). How achievement emotions impact students’ decisions for online learning, and what precedes those emotions. The Internet and Higher Education, 15(3), 161–169. https://doi.org/10. 1016/j.iheduc.2011.1.003 Tempelaar, D. T., Rienties, B., & Nguyen, Q. (2017). Towards actionable learning analytics using dispositions. IEEE Transactions on Learning Technologies, 10(1), 6–16. https://doi. org/10.1109/TLT.2017.2662679 Tempelaar, D. T., Van Der Loeff, S. S., & Gijselaers, W. H. (2007). A structural equation model analyzing the relationship of students’ attitudes toward statistics, prior reasoning abilities and course performance. Statistics Education Research Journal, 6(2), 78–102. https://doi.org/10.52041/serj.v6i2.486 Tempelaar, D. T., van der Schim Loeff, S., Gijselaers, W. H., & Nijhuis, J. F. H. (2011). On subject variations in achievement motivations: A study in business subjects. Research in Higher Education, 52(4), 395–419. https://doi.org/10.1007/s11162-010-9199-7

Bibliography

401

Tempelaar, D., Rienties, B., & Nguyen, Q. (2020). Subjective data, objective data and the role of bias in predictive modelling: Lessons from a dispositional learning analytics application. PloS One, 15(6). https://doi.org/10.1371/journal.pone.0233977 Thai, N. T. T., Wever, B. de, & Valcke, M. (2017). The impact of a flipped classroom design on learning performance in higher education: Looking for the best “blend” of lectures and guiding questions with feedback. Computers & Education, 107, 113–126. https://doi.org/ 10.1016/j.compedu.2017.01.003 Thai, N. T. T., Wever, B. de, & Valcke, M. (2020). Feedback: an important key in the online environment of a flipped classroom setting. Interactive Learning Environments, 1–14. https://doi.org/10.1080/1049482.202.1815218 Timmers, C., & Veldkamp, B. (2011). Attention paid to feedback provided by a computerbased assessment for learning on information literacy. Computers & Education, 56(3), 923–93. https://doi.org/10.1016/j.compedu.201.11.007 Tolboom, J., & Kuiper, W. (2013). How to utilize a classroom network to support teacher feedback in statistics education. In T. Plomp, & N. Nieveen (Eds.), Educational Design Research—Part B: Illustrative Cases (pp. 665–692). SLO. Tolks, D., Schaefer, C., Raupach, R., Kurse, L., Sarikas, A. et al. (2016). An introduction to the inverted/flipped Classroom model in education and advanced training in medicine and in the healthcare professions. GMS J Med Educ 33(3). https://doi.org/10.3205/zma 001045 Tolli, A. P., & Schmidt, A. M. (2008). The role of feedback, casual attributions, and selfefficacy in goal revision. The Journal of Applied Psychology, 93(3), 692–701. https://doi. org/10.1037/0021-901.93.3.692 Trautwein, U., Lüdtke, O., Marsh, H. W., Köller, O., & Baumert, J. (2006). Tracking, grading, and student motivation: Using group composition and status to predict self-concept and interest in ninth-grade mathematics. Journal of Educational Psychology, 98(4), 788–806. https://doi.org/10.1037/0022-0663.98.4.788 Trautwein, U., Marsh, H. W., Nagengast, B., Lüdtke, O., Nagy, G., & Jonkmann, K. (2012). Probing for the multiplicative term in modern expectancy–value theory: A latent interaction modeling study. Journal of Educational Psychology, 104(3), 763–777. https://doi. org/10.1037/a0027470 Trigwell, K., Ellis, R. A., & Han, F. (2012). Relations between students’ approaches to learning, experienced emotions and outcomes of learning. Studies in Higher Education, 37(7), 811–824. https://doi.org/10.1080/03075079.201.549220 Truscott, J. (2007). The effect of error correction on learners’ ability to write accurately. Journal of Second Language Writing, 16(4), 255–272. https://doi.org/10.1016/j.jslw.2007. 06.003 Tze, V., Parker, P., & Sukovieff, A. (2021). Control-Value theory of achievement emotions and its relevance to school psychology. Canadian Journal of School Psychology, 37(1), 23–39. https://doi.org/10.1177/08295735211053962 Usher, E. L., & Pajares, F. (2008). Sources of self-efficacy in school: Critical review of the literature and future directions. Review of Educational Research, 78(4), 751–796. https:// doi.org/10.3102/0034654308321456 Utts, J. (2003). What educated citizens should know about statistics and probability. The American Statistician, 57(2), 74–79. https://doi.org/10.1198/0003130031630

402

Bibliography

Valentine, J. C., DuBois, D. L., & Cooper, H. (2004). The relation between self-beliefs and academic achievement: A meta-analytic review. Educational Psychologist, 39(2), 111– 133. https://doi.org/10.1207/s15326985ep3902_3 van Alten, D. C., Phielix, C., Janssen, J., & Kester, L. (2019). Effects of flipping the classroom on learning outcomes and satisfaction: A meta-analysis. Educational Research Review, 28. https://doi.org/10.1016/j.edurev.2019.05.003 van Appel, V., & Durandt, R. (2018). Dissimilarities in attitudes between students in service and mainstream courses towards statistics: An analysis conducted in a developing country. EURASIA Journal of Mathematics, Science and Technology Education, 14(8). https:// doi.org/10.29333/ejmste/91912 van de Ridder, J. M. M., Peters, C. M. M., Stokking, K. M., Ru, J. A. de, & Cate, O. T. J. ten (2015). Framing of feedback impacts student’s satisfaction, self-efficacy and performance. Advances in Health Sciences Education : Theory and Practice, 20(3), 803–816. https://doi.org/https://doi.org/10.1007/s10459-014-9567-8 Vancouver, J. B., & Kendall, L. N. (2006). When self-efficacy negatively relates to motivation and performance in a learning context. The Journal of Applied Psychology, 91(5), 1146– 1153. https://doi.org/10.1037/0021-901.91.5.1146 Vanhoof, S., Kuppens, S., Castro Sotos, A. E., Verschaffel, L., & Onghena, P. (2011). Measuring statistics attitudes: Structure of the survey of attitudes toward statistics (sats-36). Statistics Education Research Journal, 10(1), 35–51. https://doi.org/10.52041/serj.v10 i1.354 Värlander, S. (2008). The role of students’ emotions in formal feedback situations. Teaching in Higher Education, 13(2), 145–156. https://doi.org/10.1080/13562510801923195 Vaughan, N. (2007). Perspectives on blended learning in higher education. International Journal on E-Learning, 6(1), 81–94. Vogel S., & Schwabe L. (2016). Learning and memory under stress: implications for the classroom. Npj Sci Learn, 1(16011). https://doi.org/10.1038/npjscilearn.2016.11. Walker, J. D., Cotner, S. H., Baepler, P. M., & Decker, M. D. (2008). A delicate balance: integrating active learning into a large lecture course. CBE—Life Sciences Education, 7, 361–367. https://doi.org/10.1187/cbe.08-02-0004. Wang, A. I., Zhu, M., & Saetre, R. (2016, November 7). NTNU Open: The Effect of Digitizing and Gamifying Quizzing in Classrooms. Retrieved October 2, 2022, from https:// ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2426374 Wang, J., & Wang, X. (2012). Structural equation modeling: Applications using mplus. Wiley & Sons, Incorporated, John. Wang, S.-L., & Wu, P.-Y. (2008). The role of feedback and self-efficacy on web-based learning: The social cognitive perspective. Computers & Education, 51(4), 1589–1598. https:// doi.org/10.1016/j.compedu.2008.03.004 Waples, J. A. (2016). Building emotional rapport with students in statistics courses. Scholarship of Teaching and Learning in Psychology, 2(4), 285–293. https://doi.org/10.1037/ stl0000071 Ward, B. (2004). The best of both worlds: A hybrid statistics course. Journal of Statistics Education, 12(3). Ward, P., & Walker, J. (2008). The influence of study methods and knowledge processing on academic success and long-term recall of anatomy learning by first-year veterinary students. Anat Sci Ed, 78, 68–74. https://doi.org/1.1002/ase.12

Bibliography

403

Waters, L. K., Martelli, T. A., Zakrajsek, T., & Popovich, P. M. (1988). Attitudes toward statistics: An evaluation of multiple measures. Educational and Psychological Measurement, 48(2), 513–516. https://doi.org/10.1177/0013164488482026 Watson, J. M., & Moritz, J. B. (2000). The longitudinal development of understanding of average. Mathematical Thinking and Learning, 2(1–2), 11–50. https://doi.org/10.1207/ s15327833mtl0202_2 Weiber, R., & Mühlhaus, D. (2014). Strukturgleichungsmodellierung: Eine anwendungsorientierte Einführung in die Kausalanalyse mit Hilfe von AMOS, SmartPLS und SPSS (2nd Ed.). Springer-Lehrbuch. Springer Gabler. https://doi.org/10.1007/978-3-642-35012-2 Weiber, R., & Sarstedt, M. (2021). Anwendungsprobleme der Kausalanalyse und Lösungsansätze. In Strukturgleichungsmodellierung (pp. 395–454). Springer Fachmedien Wiesbaden. https://doi.org/10.1007/978-3-658-32660-9_17 Weidinger, A. F., Spinath, B., & Steinmayr, R. (2016). Why does intrinsic motivation decline following negative feedback? The mediating role of ability self-concept and its moderation by goal orientations. Learning and Individual Differences, 47, 117–128. https://doi. org/10.1016/j.lindif.2016.01.003 Weidlich, J., & Spannagel, C. (2014). Die Vorbereitungsphase im Flipped Classroom. Vorlesungsvideos versus Aufgaben. In K. Rummler (Ed.), Medien in der Wissenschaft: Bd. 67. Lernräume gestalten—Bildungskontexte vielfältig denken (pp. 237–248). Waxmann. Weinert, F. (2001), Vergleichende Leistungsmessung in Schulen—eine umstrittene Selbstverständlichkeit. In F. Weinert (Ed.), Leistungsmessungen in Schulen (pp. 17–31). Beltz. Weinert, S., Artelt, C., Prenzel, M., Senkbeil, M., Ehmke, T., & Carstensen, C. H. (2011). 5 Development of competencies across the life span. Zeitschrift Für Erziehungswissenschaft, 14(S2), 67–86. https://doi.org/https://doi.org/10.1007/s11618-011-0182-7 Wells, C. S. (2021). Assessing Measurement Invariance for Applied Research. Cambridge University Press. Whitaker, D., Unfried, A., & Bond, M. (2022). Challenges associated with measuring attitudes using the sats family of instruments. Statistics Education Research Journal, 21(1), 4. https://doi.org/10.52041/serj.v21i1.88 Wieling, M., & Hofman. W. (2010). The impact of online video lecture recordings and automated feedback on student performance. Computers, & Education 54, 992–998. https:// doi.org/10.1016/j.compedu.2009.1.002 Wigfield, & Eccles (2000). Expectancy-value theory of achievement motivation. Contemporary Educational Psychology, 25(1), 68–81. https://doi.org/10.1006/ceps.1999.1015 Wigfield, A., & Eccles, J. S. (2002). Development of achievement motivation. Academic Press. Wigfield, A. (1994). Expectancy-value theory of achievement motivation: A developmental perspective. Educational Psychology Review, 6(1), 49–78. https://doi.org/10.1007/bf0220 9024 Wigfield, A., & Cambria, J. (2010). Students’ achievement values, goal orientations, and interest: Definitions, development, and relations to achievement outcomes. Developmental Review, 30(1), 1–35. https://doi.org/10.1016/j.dr.2009.12.001 Wigfield, A., & Eccles, J. S. (2002). The development of competence beliefs, expectancies for success, and achievement values from childhood through adolescence. In Development of achievement motivation (pp. 91–120). Elsevier. https://doi.org/10.1016/b978-012 750053-9/50006-1

404

Bibliography

Wilbers, K. (Ed.). (2011). Texte zur Wirtschaftspädagogik und Personalentwicklung: Vol. 5. Die Wirtschaftsschule: Verdienste und Entwicklungsperspektiven einer bayerischen Schulart. Shaker. Wild, C. J., & Pfannkuch, M. (1999). Statistical thinking in empirical enquiry. International Statistical Review / Revue Internationale De Statistique, 67(3), 223. https://doi.org/10. 2307/1403699 Wilde, N., & Hsu, A. (2019). The influence of general self-efficacy on the interpretation of vicarious experience information within online learning. International Journal of Educational Technology in Higher Education, 16(1). https://doi.org/10.1186/s41239-0190158-x Winquist, J. R., & Carlson, K. A. (2014). Flipped statistics class results: Better performance than lecture over one year later. Journal of Statistics Education, 22(3). https://doi.org/10. 1080/10691898.2014.11889717 Wise, S. L. (1985). The development and validation of a scale measuring attitudes toward statistics. Educational and Psychological Measurement, 45(2), 401–405. https://doi.org/ 10.1177/001316448504500226 Wise, S. L., & DeMars, C. E. (2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10(1), 1–17. https://doi.org/10. 1207/s15326977ea1001_1 Wisenbaker, J. M., Scott, J. S., & Nasser, F. (1999). A cross-cultural comparison of path models relating attitudes about and achievement in introductory statistics courses [Paper presentation]. 52nd Session of the International Statistical Institute, Helsinki. Wisniewski, B., Zierer, K., & Hattie, J. (2020). The power of feedback revisited: A metaanalysis of educational feedback research. Frontiers in Psychology, 10, 3087. https://doi. org/10.3389/fpsyg.2019.03087 Won Hur, J., & Anderson, A. (2013). iPad integration in an elementary classroom. In J. Keengwe (Ed.), Pedagogical Applications and Social Effects of Mobile Technology Integration (pp. 42–54). IGI Global. Wortha, F., Azevedo, R., Taub, M., & Narciss, S. (2019). Multiple negative emotions during learning with digital learning environments—evidence on their detrimental effect on learning from two methodological approaches. Frontiers in Psychology, 10, 2678. https:// doi.org/10.3389/fpsyg.2019.02678 Wright, J. D. (Ed.). (2015). International encyclopedia of the social & behavioral sciences. Elsevier. Wu, Y., & Kang, X. (2021). A moderated mediation model of expectancy-value interactions, engagement, and foreign language performance. SAGE Open, 11(4), 215824402110591. https://doi.org/10.1177/21582440211059176 Xiao, J., & Bulut, O. (2020). Evaluating the performances of missing data handling methods in ability estimation from sparse data. Educational and Psychological Measurement, 80(5), 932–954. https://doi.org/10.1177/0013164420911136 Xu, C., & Schau, C. (2019). Exploring method effects in the six-factor structure of the survey of attitudes toward statistics (SATS-36). Statistics Education Research Journal, 18(2), 39–53. https://doi.org/10.52041/serj.v18i2.139 Xu, C., & Schau, C. (2021). Measuring statistics attitudes at the student and instructor levels: a multilevel construct validity study of the survey of attitudes toward statistics. Journal of

Bibliography

405

Psychoeducational Assessment, 39(3), 315–331. https://doi.org/10.1177/073428292097 1389 Xu, J. (2020). Longitudinal effects of homework expectancy, value, effort, and achievement: An empirical investigation. International Journal of Educational Research, 99, 101507. https://doi.org/10.1016/j.ijer.2019.101507 Yang, C., Luo, L., Vadillo, M. A., Yu, R., & Shanks, D. R. (2021). Testing (quizzing) boosts classroom learning: A systematic and meta-analytic review. Psychological Bulletin, 147(4), 399–435. https://doi.org/10.1037/bul0000309 Yeh, S. S. (2010). Understanding and addressing the achievement gap through individualized instruction and formative assessment. Assessment in Education: Principles, Policy & Practice, 17(2), 169–182. https://doi.org/10.1080/09695941003694466 Yeo, G. B., & Neal, A. (2006). An examination of the dynamic relationship between selfefficacy and performance across levels of analysis and levels of specificity. The Journal of Applied Psychology, 91(5), 1088–1101. https://doi.org/10.1037/0021-901.91.5.1088 Yorganci, S. (2020). Implementing flipped learning approach based on ‘first principles of instruction’ in mathematics courses. Journal of Computer Assisted Learning, 36(5), 763–779. https://doi.org/10.1111/jcal.12448 Zainuddin, Z., Shujahat, M., Haruna, H., & Chu, S. K. W. (2020). The role of gamified equizzes on student learning and engagement: An interactive gamification solution for a formative assessment system. Computers & Education, 145, 103729. https://doi.org/10. 1016/j.compedu.2019.103729 Zeidner, M. (1998). Test anxiety: The state of the art. Perspectives on individual differences. Plenum Press. Zhang, D., Zhou, L., Briggs, R., & Nunamaker, J. (2006). Instructional video in e-learning: Assessing the impact of interactive video on learning effectiveness. Information & Management, 43, 15–27. https://doi.org/10.1016/j.im.2005.01.004 Zhang, Q., & Fiorella, L. (2019). Role of generated and provided visuals in supporting learning from scientific text. Contemporary Educational Psychology, 59. https://doi.org/10. 1016/j.cedpsych.2019.101808 Zhang, D., Zhou, L., Briggs, R. O., & Nunamaker, J. F. (2006). Instructional video in elearning: Assessing the impact of interactive video on learning effectiveness. Information & Management, 43(1), 15–27. https://doi.org/10.1016/j.im.2005.01.004 Zhu H.-R., Zeng H., Zhang H., Zhang H.-Y., Wan F.-J., Guo H.-H., et al. (2018). The preferred learning styles utilizing VARK among nursing students with bachelor degrees and associate degrees in China. Acta Paul Enferm, 31(2). https://doi.org/10.1590/1982-019 4201800024 Zhu, Y., Zhang, J. H., Au, W., & Yates, G. (2020). University students’ online learning attitudes and continuous intention to undertake online courses: a self-regulated learning perspective. Educational Technology Research and Development, 68(3), 1485–1519. https://doi.org/10.1007/s11423-020-09753-w Zieffler, A., Garfield, J., Alt, S., Dupuis, D., Holleque, K., & Chang, B. (2008). What does research suggest about the teaching and learning of introductory statistics at the college level? A review of the literature. Journal of Statistics Education, 16(2), 1–25. Zieffler, A. S., & Garfield, J. B. (2009). Modeling the growth of students’ covariational reasoning during an introductory statistics course. Statistics Education Research Journal, 8(1), 7–31. https://doi.org/10.52041/serj.v8i1.455

406

Bibliography

Zimmerman (2000). Self-efficacy: An essential motive to learn. Contemporary Educational Psychology, 25(1), 82–91. https://doi.org/10.1006/ceps.1999.1016 Zimmerman, M. A., Caldwell, C. H., & Bernat, D. H. (2002). Discrepancy between selfreport and School-record grade point average: Correlates with psychosocial outcomes among African American adolescents. Journal of Applied Social Psychology, 32(1), 86–109. https://doi.org/10.1111/j.1559-1816.2002.tb01421.x Zimmerman, B. J. (1989). A social cognitive view of self-regulated academic learning. Journal of Educational Psychology, 81(3), 329–339. https://doi.org/10.1037/0022-0663.81. 3.329 Zimmerman, B. J. (2013). From cognitive modeling to self-regulation: A social cognitive career path. Educational Psychologist, 48(3), 135–147. https://doi.org/10.1080/0046152. 2013.794676 Zimmerman, B. J., & Moylan, A. (2009). Self- regulation where metacognition and motivation intersect. In D. J. Hacker (Ed.), The educational psychology series. Handbook of metacognition in education (1st ed., pp. 299–315). Routledge. Zingoni, M., & Byron, K. (2017). How beliefs about the self influence perceptions of negative feedback and subsequent effort and learning. Organizational Behavior and Human Decision Processes, 139, 50–62. https://doi.org/10.1016/j.obhdp.2017.01.007 Zlatkin-Troitschanskaia, O., Förster, M., Brückner, S., Hansen, M., & Happ, R. (2013). Modellierung und Erfassung der wirtschaftswissenschaftlichen Fachkompetenz bei Studierenden im deutschen Hochschulbereich. Lehrerbildung auf dem Prüfstand (Sonderheft), 6(1), 108–133.