International Handbook of Comparative Large-Scale Studies in Education: Perspectives, Methods and Findings 3030881776, 9783030881771

This handbook is the first of its kind to provide a general and comprehensive overview of virtually every aspect of Inte

385 99 28MB

English Pages 1517 [1518] Year 2022

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
About the Editors
Section Editors
Contributors
Part I: About This Handbook
1 Introduction to the Comparative Large-Scale Studies in Education: Structure and Overview of the Handbook
Part 1: Meta-Perspectives on International Large-Scale Assessments in Education
The Societal Role of International Large-Scale Assessments in Education
Theoretical Foundations of International Large-Scale Assessments in Education
Functions and Characteristics of International Large-Scale Assessments in Education
Accomplishments, Limitations, and Recommendations
Part 2: Methodology
Designing and Implementing ILSAs
Methods of Analysis
Potential and Methods of Linking ILSA to National Education Policy and Research
Part 3: Findings
Schools, Principals, and Institutions
Classrooms, Teachers, and Curricula
Students, Competences, and Dispositions
Equity and Diversity
2 Background, Aims, and Theories of the Comparative Large-Scale Studies in Education
Introduction
Aim
Themes and Content
The Theoretical Foundation and Perspectives of the Handbook
Educational Effectiveness
The Context-Input-Process Output Model
The Hierarchical School System
Delineating the Scope
Perspectives and Principles Underlying the Handbook
Politically and Culturally Balanced
Methodological Perspectives
Extended Student Outcomes
Positioning the Handbook Within the Existing Literature
The Audience of This Handbook
Summary
References
Part II: Meta-perspectives on ILSAs: Theoretical Meta-perspectives on ILSAs
3 The Political Economy of ILSAs in Education: The Role of Knowledge Capital in Economic Growth
Overview
Understanding Growth
A Conceptual Framework for Knowledge and Growth
Growth Models with School Attainment
An Extended View of the Measurement of Human Capital
Knowledge Capital and Growth
Causality
The Gains from Universal Basic Skills
The Global Challenge
Economic Impacts of Universal Basic Skills
Conclusions
Cross-References
References
4 Reasons for Participation in International Large-Scale Assessments
Introduction
The Appeal of International Large-Scale Student Assessments
Creating a Demand for ILSAs
Conclusion
References
5 Educational Accountability and the Role of International Large-Scale Assessments
Introduction
Evidence-Based Educational Policymaking and Outcomes-Based Accountability
A Brief Overview of the History and Purpose of ILSAs
Administrative Accountability and the Role of ILSAs
A Framework for Considering the Potential for ILSAs in Accountability Systems
Research on Administrative Accountability and the Role of ILSAs
Conclusion
References
6 International Large-Scale Assessments and Education System Reform
Introduction
The Rise of International Large-Scale Assessments
The Power of Numbers
Changing Cognition and Behavior
Changing Behaviour
Penetrating Schools and Classrooms: PISA for Schools
An Alternative Model: The OECD´s SEG Framework
Case Studies
Learning Seminars
A Holistic Approach to International Comparisons?
Conclusions
References
7 The Role of International Large-Scale Assessments (ILSAs) in Economically Developing Countries
Chapter Summary
Background
Evidence-Based Policy-Making
Factors Influencing the Use of Evidence in Policy-Making
A Special Case of ILSAs: Two Regional Large-Scale Assessment Programs
Pacific Islands Literacy and Numeracy Assessment (PILNA)
A Brief Overview of PILNA
How Is PILNA Integrated into the Policy-Making Process of Participating Countries?
What Is the Role of Capacity Building Activities and Technical Quality in PILNA?
Research Capacity and the Use of Data
Quality of the Assessment Program
What Does PILNA´s Strategy Look like in Terms of Access and Results Dissemination to Stakeholders, Including the Media?
Southeast Asia Primary Learning Metrics (SEA-PLM)
A Brief Overview of SEA-PLM
What Is the Role of Capacity Building Activities and Technical Quality in SEA-PLM?
How Is SEA-PLM Integrated into the Policy-Making Process of Participating Countries?
What Does SEA-PLM´s Strategy Look like in Terms of Access and Results Dissemination to Stakeholders, Including the Media?
Conclusion
References
Part III: Meta-perspectives on ILSAs: The Role of Theory in ILSAs
8 Comprehensive Frameworks of School Learning in ILSAs
Introduction
Models of School Learning and their Developments
The Shift from Input-Output to Input-Process-Output Paradigm
Extensions from the Socio-Ecological Perspective
Extensions Reflecting the Dynamic Perspectives
Models of School Learning in ILSA Frameworks
The Beginning: Pilot Twelve-Country Study, Early ILSAs (FIMS, SIMS, IRLS), and IEA´s Foundational Curriculum Model
Pilot Twelve-Country Study (1960)
FIMS: First International Mathematics Study (1964)
Six-Subject Study and FISS (First International Science Study; 1970-71)
SIMS: Second International Mathematics Study (1980-82) and SISS: Second International Science Study (1983-84)
Subsequent ILSAs: IEA´s TIMSS and PIRLS and OECD´s PISA and TALIS
TIMSS: Third International Mathematics and Science Study (1995) and Trends in International Mathematics and Science Study (4-Y...
IRLS: International Reading Literacy Study (1990-91) and PIRLS: Progress in International Reading Literacy Study (5-Year Cycle...
PISA: Programme for International Student Assessment (3-Year Cycle from 2000 on)
TALIS: Teaching and Learning International Survey (5-Year Cycle from 2008 on)
Early Childhood Education and Care (ECEC)
Preprimary Project (1987-1989, 1992, 1995-97) and the IEA ECES: Early Childhood Education Study (2015)
OECD Starting Strong Survey (2018)
Conclusions and Recommendations
References
9 Assessing Cognitive Outcomes of Schooling
Introduction
Conceptualization of Cognitive Outcomes
Similarities and Differences of Theoretical Frameworks Across Different ILSA Studies
Organization of Theoretical Frameworks
Comparison of Content Dimensions
Comparison of Cognitive Dimensions
Development of Cognitive Dimension Across ILSAs Over Time
Mathematics Assessment Frameworks
IEA Studies
PISA
Science Assessment Frameworks
IEA Studies
PISA
Reading Assessment Frameworks
IEA PIRLS
PISA
Implementation of the Frameworks in Item Design
Format of Assessment
Comparison of TIMSS and PISA Items´ Features
New Forms of Assessment
Discussions and Conclusion
Summary of Findings
Implications
The Design of Future ILSAs
What Should a Country Be Achieving in Education?
Concluding Remarks
References
10 Socioeconomic Inequality in Achievement
Introduction
Distributive Rules
Different Goods, Different Rules
Socioeconomic Inequality: Implicit Assumptions
Measurement Issues
Indicators of Socioeconomic Status
Classification of Continuous and Categorical Measures
Standardization and Threshold Setting: International Comparability and National Specificity
Empirical Analyses
Data and Variables
Correlations Between the Different Measures of SES
Correlations Between the Different Measures of SES Inequality in Achievement
Standardization of Inequality Measures: Relative and Absolute Measures
Concluding Remarks
Cross-References
References
11 Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be Sure that They Measure What They Are Supposed to ...
Research Problem
Previous Research on OTL Effects
Conceptual Framework
The Validity of OTL Measures
Research Approach and Research Questions
Method
Assessment of Content Validity
Assessment of Convergent Validity
Results
Content Validity: Comparison of the Content of the TIMSS and PISA OTL Measures
OTL Measures in TIMSS and PISA
Assessment/Analytical Frameworks
Similarities Between the Assessment/Analytical Frameworks and the OTL Measures
Key Findings: Interpretation of the Content Comparison with an Eye to Content Validity and Concurrent Validity
Interpretation of the Comparison in Terms of Face Validity
Convergent Validity: The Association of OTL with Student Performance in TIMSS and PISA
Data Analysis
Key Findings with Regard to Convergent Validity
Discussion
Conclusion
References
12 Using ILSAs to Promote Quality and Equity in Education: The Contribution of the Dynamic Model of Educational Effectiveness
Introduction
A Brief History of International Large-Scale Assessment Studies in Education
ILSAs Conducted by the IEA
ILSAs Carried Out by the OECD Measuring Student Learning Outcomes
A Brief Historical Overview of the Educational Effectiveness Research
Connections Between ILSA and EER
The Dynamic Model of Educational Effectiveness: An Overview
Advancements of ILSAs by Making Use of the Dynamic Model
References
Part IV: Meta-perspectives on ILSAs: Characteristics of ILSAs
13 Overview of ILSAs and Aspects of Data Reuse
Introduction
Agencies, Studies and Cycles, and Participating Entities
ILSA Objectives
Domains of Investigation
Target Populations and Samples
Data Collection
ILSA International Databases and Aspects Related to Data Analysis
Data Reuse (Secondary Analyses)
Conclusion
References
14 IEA´s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement
IEA´s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement
Gradually Updating the TIMSS and PIRLS Frameworks
Role of the TIMSS and PIRLS Encyclopedias
Roles of Committees of International Experts
Role of the National Research Coordinators
Large Numbers of Items Continually Refreshed
The PIRLS 2021 Rotated Design and Number of Items
The TIMSS 2023 Rotated Design and Number of Items
Introducing Ambitious Innovations on a Small Scale
Less Difficult TIMSS and PIRLS Assessments
Transitioning to e-Assessment
Methodology for Linking Successive Assessments to a Common Scale
Item Calibration, Conditioning, and Plausible Values
Concurrent Calibration
Bridges When There Are Major Changes
TIMSS 2007 Bridge
TIMSS 2019 Bridge
Phase 1: Item Calibration for paperTIMSS and Bridge Data
Phase 2: Linking eTIMSS Data to TIMSS Trend Scales
Conclusion/Summary
References
15 IEA´s Teacher Education and Development Study in Mathematics (TEDS-M)
Introduction
The Need for a Teacher ILSA
Structure of the Present Chapter
Conceptual Framework
Macro Level: Countries´ National Context
Meso Level: Institutional Context
Microlevel: Future Teachers´ Background
Teacher Education Outcomes: Teachers´ Professional Knowledge
Teacher Education Outcomes: Teachers´ Professional Beliefs
Research Questions
Study Design of TEDS-M
Sampling
Instruments: National (Macro Level) and Institutional Context (Meso Level)
Instruments: Future Teacher Surveys (Microlevel)
Data Analysis
Results
Macro Level: Structure and Quality Assurance of Teacher Education
Meso Level: Characteristics of Teacher Educators and Opportunities to Learn
Microlevel: Characteristics of Future Teachers (Based on Blömeke and Kaiser, 2014)
Outcomes of Teacher Education
Further Developments
Conclusions
References
Useful Resources
16 OECD Studies and the Case of PISA, PIAAC, and TALIS
Introduction
Programme for International Student Assessment (PISA)
Survey Objectives
Survey Periodicity and Country/Economy Participation
Survey Design, Methods, and Instruments
Assessment Framework
How PISA Differs from Other International Large-Scale Assessments
Key Insights to Date
Almost One in Four Students Did Not Reach a Minimum Proficiency Level in Reading on Average Across OECD Countries in 2018
Pulling Up Low Performers Without Affecting Overall Performance or Top Performers
PISA Results Help Identify Low Performers and Where They Are
Aiming for High Performance and Equity in Education Simultaneously
Allocation of Resources Matters More Than the Level of Spending After a Certain Investment Threshold
Identifying the Conditions That Support the Positive Relationship Between School Autonomy and Students´ Performance
High-Performing Education Systems Hold High Expectations for Their Students
Students in High-Performing Education Systems Tend to Believe Their Abilities Can Be Developed Over Time
Future Developments
The Survey of Adult Skills (PIAAC)
Survey Objectives
Survey Periodicity and Country Participation
Survey Design and Methods
Study Instruments
Background Questionnaire
Direct Assessment
Assessment Frameworks
Relationship of PIAAC to Previous International Adult Assessments
Key Insights
Developments in the Second PIAAC Assessment
The Teaching and Learning International Survey (TALIS)
Survey Objectives
Survey Periodicity and Country/Economy Participation
Survey Design and Methods
Study Instruments
Survey Conceptual Framework
Relationship of TALIS to Other International Surveys
Key Insights
A Key Asset for Many Countries Is to Benefit from a Highly Educated and Altruistic Teacher Workforce
Yet, the Delivery of Quality Instruction Is Often Hindered by Shortages of Qualified Teachers, and Novice Teachers Tend to Be ...
This Underlines the Importance of Induction for a Flying Start in the Profession
But Exchanges and Collaboration with Peers Matter Beyond the Early Years and Throughout Teachers´ Careers
The School and Classroom Climate and Culture also Matter
Teachers´ Perceptions About Their Jobs Are Related to a Range of Different Factors
Appraisal Systems, if Well-Designed, Hold Great Potential
And School Leaders Have a Role to Play to Foster a Greater Sense of Job Satisfaction Among Teachers
Future Development Perspectives
Conclusion
References
17 Regional Studies in Non-Western Countries, and the Case of SACMEQ
Introduction
Main Aims, Frameworks, and Designs of ILSA Studies
Main Aims
Frameworks of ILSAs
Organizing Framework Proposed by the OECD
Conceptual Frameworks of the IEA
Designs
Frequency and Scale of Studies
Population and Sampling
SACMEQ Studies
Introduction
Aims and Objectives of SACMEQ
Governance, Management, and Coordination of SACMEQ
Framework of SACMEQ
Design of SACMEQ Studies I-IV
Population and Sampling
Instruments
Tests
Questionnaires
Data Collection and Analysis
Data Collection
Analysis
Questionnaire Analysis
Reflections and Conclusion
Cross-References
References
18 The Use of Video Capturing in International Large-Scale Assessment Studies: Methodological and Theoretical Considerations
Introduction
Issues and Challenges to Consider
Purpose of the Study
Dimensions of Teaching Practices Captured
Dimensions of Teaching Practices Captured: Looking Across Frameworks
Theoretical Underpinning: Views of Teaching and Learning
Subject Specificity
Grain Size
Scoring Specifications
Focus on Students or Teachers (or Both)
Empirical Evidence: Connecting Teaching with Student Outcomes
Ethics
Technology: A New Generation of Video Studies
Moving Forward
Teacher Evaluation
Video Documentaries As Longitudinal Data
Videos As a Means of Improving Teachers´ Professional Learning
Concluding Remarks
References
19 Comparison of Studies: Comparing Design and Constructs, Aligning Measures, Integrating Data, Cross-validating Findings
Introduction
Prelude: Covering 48 years in Mathematics Education
Comparing and Finding Common Grounds in Design and Conceptualization
Country Coverage
Sampling
Domain of Assessment and Further Skills Covered
Context, Input, Process, and Output-Related Constructs
Comparing, Linking, and Matching Operational Measures
Differences in Instruments and Measures
Linking Measures from Different Studies
Searching for Universal Descriptors and Scales
Combining Data Across Assessments in Educational Research
Integrating Diverging and Converging Findings Across Studies
Matching Data
Longitudinal Designs
Analyzing ILSA Data on the Country Level
Comparing TIMSS and PISA Achievement Results on the Country Level
Comparing Country Level Trends Based on TIMSS and PISA
Conclusions and Implications for Further Research
References
Part V: Meta-perspectives on ILSAs: Accomplishments, Limitations, and Recommendations
20 ILSA in Arts Education: The Effect of Drama on Competences
Introduction
Historical, Pedagogical, and Epistemological Grounding
Concepts of Knowledge and Arts Education
The Field of Drama/Theatre Education
Drama as an Aesthetic Discipline
Drama as an Arts Education Subject
Learning About and In the Drama Subject: The Double Content
Learning With or Through Drama: Drama as a Learning Method in Other Subjects
Drama - and the Arts Hegemony in Schools
Applied Drama and Theatre
Competences Which Develop Through the Arts
An ILSA in Drama Education: The DICE Study
Motivation, Consortium, Objectives, and Hypothesis
An Introduction to the Methodology
The Sample
Data Collection
Eight Different Sources of Data
Overview of the Key Results
Discussion: How Arts Educators Assess Assessments
Lack of International Large-Scale Quantitative Assessments in the Field of Drama Education
Scepticism of Drama Researchers and Practitioners About Quantitative Measurements
Conclusion: Designing Large-Scale Assessment Studies in Drama Education
Appendix
References
21 Future Directions, Recommendations, and Potential Developments of ILSA
Preliminary Remarks
Introduction
History
Increased Participation
Future Directions
Recommendations
Increased Coverage
Study Frameworks
Changes in Curricula
Future Directions
Recommendations
Increased Depth: Added Value Through Computer-Based Assessment
Future Directions
Recommendations
Summary
References
22 Conceptual and Methodological Accomplishments of ILSAs, Remaining Criticism and Limitations
ILSAs: Products of Their Time
A Brief Summary of ILSA History
Conceptual Accomplishments and Limitations
Boosting Theory Development in Education
Comprehensive Conceptual Modelling of Educational Outcomes
Including Educational Outcomes Beyond Cognition
Including Cognitive Outcomes Beyond Mathematics, Science, and Reading
Including Educational Outcomes of Populations Beyond School Students
Including Educational Outcomes of Low- and Middle-Income Countries
Describing Trends in Educational Outcomes and Boosting Longitudinal Studies
Modelling Predictors of Educational Outcomes
Limited Theoretical Foundation Jeopardizing Validity
Definition and Operationalization of Constructs
What Do We Want and Need to Measure?
How Do We Want to Compare the Results?
Comparability of Items and Constructs Across Countries
Comparability of Target Populations Across Countries
What Is Driving the Implementation and Expansion of ILSAs?
Methodological Accomplishments and Limitations
Infrastructure and Researcher Training: Boosting Assessment Capacity
Strengthening Measurement Quality
Further Development of Technical Approaches
Implementing New Types of Assessments
Challenges to Ensure Measurement Quality in Other Respects
Limited Estimation Precision for Highest and Lowest Performing Countries
Limited Comparability Over Time Due to Assessment Changes
Limited Accounting for the Cluster Structure
Lack of Possibility to Draw Causal Inferences
Difficulties to Communicate Methodology and Study Limitations
Empirical Accomplishments and Limitations
Setting Education and Schooling Back on the Agenda
Making Sure That Public Education Has the Quality Needed
Policy, Media, and ILSAs: Sometimes an Unfortunate Combination
Summary and Conclusions
Summary
Conclusions
References
Part VI: Methodology: Designing and Implementing ILSAs
23 Sampling Design in ILSA
Introduction
Ensuring Cross-National Comparability
Inferring Population Parameters from Complex Samples
Sampling Variance in Complex Samples
Summary and Conclusions
Cross-References
References
24 Designing Measurement for All Students in ILSAs
Introduction
Expanding the Scope in ILSA Background Questionnaires
Expanding Achievement Estimation in ILSAs
Discussion
Conclusion
References
25 Implementing ILSAs
Introduction
Comparability, Validity, and Reliability of International Assessments
International Large-Scale Assessments Discussed in This Chapter
PISA
PIRLS
TIMSS
ICILS
ICCS
Identifying a Common Construct Domain
Process and Procedures of Defining and Operationalizing a Domain
Comparability and Cross-Cultural Validity
Other Sources of Incomparability in International Assessments
Supervision and Quality Assurance
The Role of Technology and Complexity
The Widening Participation of Middle-Income Countries
Validation of Innovative Constructs in International Assessment
Financial Literacy (PISA)
Problem Solving (PISA)
Problem Solving and Inquiry (TIMSS)
Civic Knowledge (ICCS)
Conclusion
References
26 Dilemmas in Developing Context Questionnaires for International Large-Scale Assessments
Introduction
Overview of the Questionnaire Development Process
Updating the TIMSS, PIRLS, and PISA questionnaires
Overarching Principles for Questionnaire Development
Questionnaire Development Dilemmas
Dilemma 1: Use of scales and indices versus stand-alone items to measure educational constructs
Dilemma 2: Updating the Surveys Versus Measuring Trends
Dilemma 3: Prioritizing predictors of achievement scores versus conceptualizing questionnaire items as stand-alone outcomes
Dilemma 4: Covering versus not covering topical policy topics
Dilemma 5: Including innovative item types versus all traditional items
Conclusions
References
27 Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement
Introduction
Some Design Choices and a Review of Related Studies
Realizing the Potential of MST
Conclusion
Appendix A: Some Technical Details for SLRR19 and RLSR20
RLSR20
SLRR19
References
Part VII: Methodology: Methods of Analysis
28 Secondary Analysis of Large-scale Assessment Databases
Introduction
About Large-scale Assessments
Sample Size and Reporting Requirements in Large-scale Assessments
Uncertainty in Large-scale Assessments
What Are Plausible Values?
How to Work with Plausible Values?
How to Compute Measurement Variance?
What Are Sampling Weights?
How to Work with Sampling Weights?
How to Compute Sampling Variance?
Using Replicate Weights
Taylor Series Approximation
Why Do These Things Matter?
References
29 Methods of Causal Analysis with ILSA Data
Introduction
Causality
Correlation and Causation
The Randomized Experiment
An Example: The STAR Experiment
The Endogeneity Problem
Models of Causality
Endogeneity Bias and Other Threats to Valid Causal Inference
Analytical Techniques to Support Causal Inference from Observational Data
Fixed Effects Models
Fixed Effects Models: An Illustration of the Age Effect on Mathematics Achievement
Fixed Effects Models: Alternative Techniques for Conducting Analyses with Fixed Country Effects
Regression Discontinuity Designs
Regression Discontinuity Designs: An Empirical Example of the Schooling Effect
Instrumental Variable Techniques
IV Techniques: Correcting for Errors of Measurement in Independent Variables
IV Techniques: Estimating IV Models with SEM
IV Techniques: Estimating Fuzzy Regression
Conditioning on Observed and Latent Variables
Discussion and Conclusions
Criticisms of Causal Approaches in ILSA Research
How to Deal with the Endogeneity Problem?
References
30 Trend Analysis with International Large-Scale Assessments
Introduction
Brief Background and Design of TIMSS, PIRLS, and PISA
How Are Trends Reported for TIMSS and PIRLS?
How Are Trends Reported for PISA?
Reporting of Trend Results and Policy Reactions
Current Issues in Trend Reporting
Future Directions and Opportunities
References
31 Cross-Cultural Comparability of Latent Constructs in ILSAs
The Framework of Bias and Equivalence
Taxonomy of Bias
Taxonomy of Equivalence
Review of the Practice of Equivalence Testing in ILSAs
General Remarks
Scale Reliability by Country
PCA by Country
Comparison of Scales´ Correlations by Country
CFA Model Fit for Each Country
MGCFA
IRT Item Fit
Model-Based Approaches for Partial Measurement Invariance
The Expanding Toolbox of Equivalence Testing
Conclusions and Recommendations
References
32 Analyzing International Large-Scale Assessment Data with a Hierarchical Approach
Introduction
Aims of This Chapter
Systematic Review of Studies Taking a Hierarchical Approach to Analyzing ILSA Data
Context and Aims of the Systematic Review
Literature Search
Screening and Coding of Studies
Description of the Selected Studies
Publication Features
Data Structures
Aggregation of Variables and Construct Representation
Centering of Predictor Variables
Model Estimation and Evaluation
Handling Missing Data
Review of Selected Multilevel Modeling Approaches
Brief Overview of Existing Frameworks
Multilevel Regression Models
The Null Model
The Hierarchical Linear Model
Contextual Effects Models
Cross-Level Interaction Models
Limitations of Multilevel Regression Analysis
Multilevel Structural Equation Models
Latent Decomposition of Observed Variables
Latent Covariate Contextual Models
Multilevel Mediation Models
Multilevel Confirmatory Factor Analysis
Structural Models in MSEM
Multilevel Mixture Models
Issues Specific to the Multilevel Modeling of ILSA Data
Plausible Values
Multigroup and Incidental Multilevel Data Structures
Survey Weights
Illustrative Examples
Data Sources
Results
Multilevel Regression Models
Multilevel Confirmatory Factor Analysis
Multilevel Structural Equation Models
Multilevel Latent Profile Analysis
Conclusion
Appendices
Appendix A: Database Search Terms
Appendix B: PRISMA Statement
Appendix C: Description of the Variables Used in the Illustrative Examples
References
33 Process Data Analysis in ILSAs
Introduction
What Are Process Data?
What Kind of Information Can We Get from Process Data?
How Are Process Data Related to Response Process?
An Ecological Framework for the Analysis of Process Data
How Could the Proposed Ecological Framework Help the Analysis of Process Data from ILSAs?
Method
What Do We Know About the Analysis of Process Data from ILSAs?
Which Studies with Process Data from ILSAs Are Analyzed Here?
Results
When Did the First Studies Start to Be Published?
What Approaches and Strategies Were Used in the Analyses?
Item-Level Analysis (Layer 1)
Item-Level Analysis and Group-Level Characteristics (Layers 1, 5; Layers 1, 6)
Group of Items/Test-Level Analysis (Layer 2)
Group of Items/Test-Level Analysis and Personal Characteristics (Layers 2, 3)
Group of Items/Test-Level Analysis and Group-Level Characteristics (Layers 2, 3, 6)
Conclusion
The Potential and Limitations of Process Data in International Large-Scale Assessments
How Could the Ecological Framework Contribute to the Advancement of the Field?
What Is There to Come?
References
Part VIII: Methodology: Potential and Methods of Linking ILSA to National Education Policy and Research
34 Extending the ILSA Study Design to a Longitudinal Design
Introduction
Findings on the Impact of Tracking
Tracking in Czech Lower Secondary Education
Design of the Czech Longitudinal Study in Education
CLoSE Study Sampling and Data Collection
Measures of Achievement and Questionnaires
Mathematics Test
Czech Language
Reading Comprehension
Learning to Learn Competence Assessment
Questionnaires Used in Grades 5, 6, and 9
The Study of the Transition from Primary School to the Long Academic Track
Research Questions
Data and Methods
Results
Discussion of Findings on the Transition to the Long Academic Track
The Study of the Effects of the Long Academic Track on Students´ Achievement
Research Questions
Data and Methods
Results
Discussion of the Effects of the Long Academic Track on Student Progress
Conclusion/Summary: Pros and Cons of Using an Extension of TIMSS & PIRLS for National Purposes
References
35 Extending International Large-Scale Assessment Instruments to National Needs: Germany´s Approach to TIMSS and PIRLS
Introduction
Adaption of Survey Instruments
Example of Rewording to Ensure Cultural Appropriateness
Teacher Education (Teacher Questionnaire)
Extending Questionnaires
Example I: Describing Social Heterogeneity in Society by Looking at Relevant Subgroups (Extending the Home Questionnaire)
Example II: Addressing the Multidimensional Outcomes of Teaching and Learning (Extension of Student Questionnaire)
Example III: Investigating the Quality of Instruction (Extension of Student Questionnaire)
Example IV: Capturing Education Reform Programs (Extending the School Questionnaire)
Adapting or Extending Survey Documentation
Furthering Research: Perspectives that Will Answer Future Questions
Additional Longitudinal Component
Adding a Mixed-Method Component
Conclusion/Summary
References
36 Extending the Sample to Evaluate a National Bilingual Program
Introduction
Method
Participants
Instruments: Reading Comprehension Test
Predictor Variables of Academic Achievement
Procedure
Booklet Rotation Procedure
Data Analysis
Results
Conclusions
Appendix
References
37 A Non-Western Perspective on ILSAs
Introduction
Cultural and Educational Background
History of Education in the Gulf States
Societal Structure
Overview on the Educational Systems in the Gulf States
Results from Gulf States´ Participation in ILSAs
Participation in ILSAs
Results from ILSAs
Overall Performance on Primary and Secondary Level
Performance in Relation to Gender (TIMSS and PIRLS)
Performance in Relation to Nationality Status (TIMSS)
Educational Aspirations
Impact and Use of ILSAs in the Region
Summary of the TIMSS and PIRLS Encyclopedias
Policymaking
Curriculum
School Feedback
Teacher Training
Awareness and Student Motivation
Implementing Educational Reforms in Relation to Student Achievement
Showcase: Impact of TIMSS and PIRLS on the Omani Education System
Conclusion
Cross-References
References
Part IX: Findings: Schools, Principals, and Institutions
38 A Systematic Review of Studies Investigating the Relationships Between School Climate and Student Outcomes in TIMSS, PISA, ...
Introduction
Theoretical Framework
Multidimensionality of School Climate
The Aspects of School Climate
Academic
Community
Safety
Institutional Environment
School Climate and Student Outcomes
The Present Study
Methods
Search Procedures
Eligibility Criteria
Screening and Selection Process
Coding and Data Extraction
Findings
RQ1: How Is School Climate Assessed in ILSAs?
RQ2: What Characterizes the Studies Investigating the Relation Between School Climate Aspects and Student Outcomes Using ILSA ...
Aim of the Studies
Data and Samples
Methodological Appropriateness
RQ3: What Is the Pattern of Findings from the Studies Investigating the Relation Between School Climate Aspects and Student Ou...
Safety
Academic Climate
Community
Institutional Environment
Discussion and Implications
The Assessment of School Climate in ILSAs
A Systematic Review of ILSA-Related Research on the Relationships Between School Climate and Student Outcomes
Pattern of Findings from the Systematic of ILSAs Studies
Conclusion
Cross-References
References
Part X: Findings: Classrooms, Teachers, and Curricula
39 Teaching Quality and Student Outcomes in TIMSS and PISA
Introduction
Theoretical Background
ILSAs and Teaching Quality: History, Challenges, and Affordances
Brief Overview of the History of Teaching Quality in TIMSS and PISA
Challenges and Affordances of Using TIMSS and PISA for Measuring Teaching Quality
The TBD Framework
Teaching Quality in TIMSS: A Systematic Review
Methods
Search Procedure
Screening Process
Coding and Data Extraction
Results
Assessment of Teaching Quality in TIMSS (RQ1)
TIMSS Frameworks
Questionnaires
Characteristics of Studies Utilizing TIMSS (RQ2)
Samples
Outcomes
Methods
Measures of TQ
Patterns of Findings
Summary and Discussion of the Systematic Review
Teaching Quality in PISA
Getting Started: Teaching Quality in PISA 2000-2009
Defining and Implementing Measures of Teaching
Early Findings and Publications on Teaching Quality in PISA 2000-2009
PISA 2000 (Reading)
PISA 2003 (Mathematics)
PISA 2006 (Science)
PISA 2009 (Reading)
Implementing a Coherent Approach to the Measurement of Teaching Across Domains: PISA 2012, 2015, and 2018
Defining and Implementing an Overarching Conceptualization of Teaching
Cutting Across Domains: Integrating Data on Teaching Quality from PISA 2012, 2015, and 2018
Research on Teaching Mathematics Based on PISA 2012
Research on Science Teaching Based on PISA 2015
Validating and Checking Measurement Quality for PISA Scales on Teaching Quality
Reliability
Measurement Invariance
Validity
Discussion
Discussion of Results
Frameworks
Secondary Analyses Studies
Relations to Student Outcomes
Limitations, Contributions, and Concluding Remarks
References
40 Inquiry in Science Education
Introduction
Theoretical Framework
Inquiry as an Instructional Approach and Outcome
The Assessment of Inquiry in TIMSS
Inquiry as an Instructional Approach
Inquiry as an Instructional Outcome
The Assessment of Inquiry in PISA
Inquiry as an Instructional Approach
Inquiry as an Instructional Outcome
The Present Study
Methods
Literature Search
Inclusion Criteria
Search and Screening Process
Findings
RQ1: The Main Characteristics of the Studies
Aim of the Studies
TIMSS and/or PISA Data Analyzed in the Studies
The Measurement and Analysis of Inquiry
RQ2: Contribution of the Studies to the Research on Inquiry in Science Education
Inquiry as an Instructional Approach
Inquiry as an Instructional Outcome
Inquiry as an Instructional Approach and Outcome
Discussion
Conclusion/Summary
Cross-References
References
41 Teacher Competence and Professional Development
Introduction
Theoretical Background on Teacher Competence
Conceptualizing Teacher Competence
Conceptualization of GPK in Large-Scale Assessments
Teachers´ Situation-Specific Skills
Teacher Competence as a Predictor and an Outcome of Quality Education
Findings from ILSA on Teacher Competence
Measuring Teacher Competence in ILSA
Teacher Education
Affective-Motivational Facets of Teacher Competence
Teaching Experience
Professional Development
Associations between Teacher Characteristics and Student Learning Outcomes
Teacher Education
Affective-Motivational Facets of Teacher Competence
Professional Development
The Structure and Development of Teacher Competence with a Focus on Knowledge and Skills
Discussion
Summary
Outlook
References
42 Teachers´ Beliefs
Introduction
Purpose of This Chapter
Background on Teachers´ Professional Beliefs
Teachers´ Self-Efficacy
Teachers´ Satisfaction
Professionalism
Teachers´ Commitment
Leadership Beliefs
Data and Methods
Results
Research Proliferation: ILSA Studies on Teachers´ Professional Beliefs
Research Publications: Conceptualization of Teachers´ Professional Beliefs
Research Results: A Synthesis
Opportunities for Future Research
Conclusion
References
43 Homework: Facts and Fiction
Introduction
Homework Practices in Different Countries
Do Countries with More Homework Have Better Results?
Student Homework Behavior
Student Homework Time and Academic Achievement
Does the Time Spent on Homework Matter or Not?
The ``How´´ Is More Important Than the ``How Much´´
The Role of the Teachers
Family Support with Homework
Conclusions and Future Perspectives
Summary of Key Findings and Some Conclusions
Future Research Directions
References
Part XI: Findings: Students, Competences, and Dispositions
44 International Achievement in Mathematics, Science, and Reading
Overview
TIMSS 2015
PIRLS 2016
TIMSS 2015 and PIRLS 2016 International Benchmarks
Primary School Achievement in Mathematics
Fourth-Grade Mathematics: Achievement Descriptions at International Benchmarks
Fourth-Grade Students Reaching the Low and High TIMSS International Benchmarks in Mathematics
Primary School Achievement in Science
Fourth-Grade Science: Achievement Descriptions at International Benchmarks
Fourth-Grade Students Reaching the Low and High TIMSS International Benchmarks in Science
Primary School Achievement in Reading
Fourth-Grade Reading: Achievement Descriptions at International Benchmarks
Fourth-Grade Students Reaching the Low and High TIMSS International Benchmarks in Reading
Secondary School Achievement in Mathematics
Eighth-Grade Mathematics: Achievement Descriptions at International Benchmarks
Eighth-Grade Students Reaching the Low and High TIMSS International Benchmarks in Mathematics
Secondary School Achievement in Science
Eighth-Grade Science: Achievement Descriptions at International Benchmarks
Eighth-Grade Students Reaching the Low and High TIMSS International Benchmarks in Science
Conclusion/Summary
References
45 Digital Competences: Computer and Information Literacy and Computational Thinking
Introduction
Background
ICT as a Cross-Curricular Learning Area
Two Domains of Digital Competence
Historical Background of Assessing Digital Competence
Assessing Digital Competence in Cross-National Contexts
The International Computer and Information Literacy Study (ICILS)
Survey Development and Design
The Measurement of Computer and Information Literacy (CIL)
The Measurement of Computational Thinking (CT)
The Measurement of Contextual Variables
Results and Interpretations
Results from ICILS 2018 Regarding Students´ CIL and CT
The Context for Digital Learning
Explaining Variation in CIL and CT
Conclusion
Interpretations and Implications for Policy and Practice
References
46 Student Motivation and Self-Beliefs
Introduction
How Can Student Self-Beliefs and Motivation be Understood?
Assessment of Student Self-Beliefs and Motivation in ILSA Contexts
The ILSA Motivational Frameworks
Complex Constructs, Confusing Terminology?
Research on Motivation and Self-Beliefs Using ILSA Data
Are Motivation and Self-Beliefs Associated with Performance in ILSA Contexts?
Self-Beliefs and Motivation in Different Countries and Cultures: Can Findings be Validly Compared?
Other Group Differences in Levels of Motivation and Self-Beliefs
How Is ILSA Motivation Research Situated within the Larger Motivation Research Field?
Current Trends and Future Directions
Concluding Remarks
References
47 Well-Being in International Large-Scale Assessments
Introduction
What Is Child Well-being?
Criticism of Measuring Well-being in ILSAs
Including Well-being in ILSAs
Cognitive Dimension
Psychological Dimension
Social Dimension
Material: Economic Dimension
Physical Dimension
What Can Be Learnt from ILSAs About Children´s Well-being?
What Can Be Learnt from ILSAs About Adults´ Well-being?
Conclusions and Implications
References
Part XII: Findings: Equity and diversity
48 Gender Differences in School Achievement
Introduction
Studying Gender Differences in Achievement in ILSAs
Research Questions and Methods of Inquiry
Describing Gender Differences in ILSAs
Gender Gaps in Reading
Gender Gaps in Reading in Fourth-Grade Students
Gender Gaps in Reading in 15-Year-Old Students
Gender Gaps in Mathematics
Gender Gaps in Mathematics in Fourth-Grade Students
Gender Gaps in Mathematics in Eighth-Grade Students
Gender Gaps in Mathematics in 15-Year-Old Students
Gender Gaps in Science
Gender Gaps in Science in Fourth-Grade Students
Gender Gaps in Science in Eighth-Grade Students
Gender Gaps in Science in 15-Year-Old Students
Gender Gaps in Civic and Citizenship
Gender Gaps in Computer and Information Literacy
Secondary Analyses on Gender Differences in ILSAs
Reading Achievement
Explanatory Variables at the Student Level
Explanatory Variables at the Teacher and School Level
Explanatory Variables at the Country Level
Studies on Gender Gaps in the Variation of Reading Scores
Mathematics Achievement
Explanatory Variables at the Student Level
Explanatory Variables at the Teacher and School Level
Explanatory Variables at the Country Level
Studies on Gender Gaps in the Variation of Mathematics Scores
Association Between Reading and Mathematics Gender Gaps
Science Achievement
Explanatory Variables at the Individual and Family Level
Explanatory Variables at the Teacher and School Level
Explanatory Variables at the Country Level
Studies on Gender Gaps in the Variation of Science Scores
Civic and Citizenship Achievement
Computer and Information Literacy
Discussion
Overall Results
Methods of Analysis of the Secondary Studies
Implications of the Literature Review for Future Research
Embedding the Findings in a Broader Societal and Historical Context
Conclusion
Appendix: Details of the Literature Search
References
49 Dispersion of Student Achievement and Classroom Composition
Introduction
Learning Environment and the Student Body
Why Do Schools and Classrooms Differ in the Composition of Their Student Body?
Why Do We Assume that the Composition of the School or Classroom Affects Students´ Individual Development?
Operationalization of Student Composition
Overview of Findings on Achievement Composition
Descriptive Results on Achievement Dispersion
Effects of Achievement Dispersion and Level of a Group on Individual Student Achievement
Effects of Achievement Dispersion on Individual Student Achievement
Effects of Classroom or School Achievement Level on Individual Student Achievement
Differential Effects for High and Low Achievers
Summary of the Key Findings
Effects of Achievement Dispersion and Level of a Group on Individual Motivational and Psychosocial Outcomes of Students
Effects of Achievement Dispersion on Individual Motivational and Psychosocial Outcomes
Effects of Achievement Level on Individual Motivational and Psychosocial Outcomes
Differential Effects for High and Low Achievers
Summary of the Key Findings
Discussion
Limitations and Knowledge Gaps for Future Research
Criticisms of Research on Classroom and School Composition
Possible Ways of Addressing this Criticism
Future Directions
Composition of the Student Body: An Equity Perspective
Summary and Conclusion
References
50 Perspectives on Equity: Inputs Versus Outputs
Introduction
The Study of Equity and the Study of Equity with ILSAs
How to Address Equity: The Links Between Inputs and Outputs, and Between Equity and Excellence
An Analysis of Equity in Inputs and Outcomes for Students and Adults
Conclusions
Appendix A
Appendix B: Definition of Outcomes
PISA
PIAAC
References
51 Family Socioeconomic and Migration Background Mitigating Educational-Relevant Inequalities
Introduction
Data Collection Method
Keywords in the Literature Search
Selection Criteria and Screening Process
Measurement of Socioeconomic Status
Socioeconomic Status: Concept and Indicators Available
Operationalization of Socioeconomic Status in SES-Achievement Studies
The Function of Socioeconomic Status in Analyses
Analytical Methods Applied in the Studies
Effects of SES on Academic Achievement
SES Effect Based on Different Indicators
Effect of Cultural Capital on Individual Achievement
School Composition Effect on Student Outcome
Measurement of Family Migration
Migration Background: Concept and Indicators Available
Operationalization of Migration Background in SES-Achievement Studies
Functions of Migration Background
Effect of Migration on Achievement
Effect of Region of Origin
Effects of Migration at the School Level
Varying SES Effect on Migration Status
Summary of the Findings
Discussion Remarks
Appendix A
Literature Search Command in ProQuest
Search 1: SES and Migration = 101 Results
Search 2: Migration Only = 242 Results
Search 3: SES Only = 848 Results
Appendix B
Articles in the Final Round of Literature Review, Considered in the Chapter
Appendix C
Indicators of Family Socioeconomic Status in PISA, TIMSS, and PIRLS in each Cycle
References
Part XIII: Concluding Remarks
52 60-Years of ILSA: Where It Stands and How It Evolves
Meta-Perspectives on ILSAs in Education
Theoretical Meta-Perspectives: Educational Accountability and the Role of International Assessments
Reasons for Participation in ILSA
The Political Economy of ILSAs in Education: The Role of Knowledge Capital in Economic Growth
Educational Accountability and the Role of ILSAs in Economically Developed and Developing Countries
Theoretical Frameworks and Assessed Domains in ILSA
Comprehensive Frameworks in ILSA
Assessed Domains and Content Coverage
Populations and Design in ILSAs
Populations
Quantitative and Qualitative approaches in ILSA
Methods in ILSA
Generalizability and Comparability of Results
Analytical Potential of ILSA Data for Causal Analysis
Trend Analysis with ILSA Data
Log-Data
Findings
Schools, Principals, and Institutions
Classrooms, Teachers, and Curricula
Students, Competences, and Dispositions
Equity and Diversity
Final Remarks
References
Index
Recommend Papers

International Handbook of Comparative Large-Scale Studies in Education: Perspectives, Methods and Findings
 3030881776, 9783030881771

  • 0 1 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Springer International Handbooks of Education

Trude Nilsen · Agnes Stancel-Piątak Jan-Eric Gustafsson Editors

International Handbook of Comparative Large-Scale Studies in Education Perspectives, Methods and Findings

Springer International Handbooks of Education

The Springer International Handbooks of Education series aims to provide easily accessible, practical, yet scholarly, sources of information about a broad range of topics and issues in education. Each Handbook follows the same pattern of examining in depth a field of educational theory, practice and applied scholarship, its scale and scope for its substantive contribution to our understanding of education and, in so doing, indicating the direction of future developments. The volumes in this series form a coherent whole due to an insistence on the synthesis of theory and good practice. The accessible style and the consistent illumination of theory by practice make the series very valuable to a broad spectrum of users. The volume editors represent the world's leading educationalists. Their task has been to identify the key areas in their field that are internationally generalizable and, in times of rapid change, of permanent interest to the scholar and practitioner.

Trude Nilsen • Agnes Stancel-Piątak • Jan-Eric Gustafsson Editors

International Handbook of Comparative Large-Scale Studies in Education Perspectives, Methods and Findings

With 194 Figures and 114 Tables

Editors Trude Nilsen Department of Teacher Education and School Research, Faculty of Educational Sciences University of Oslo Oslo, Norway

Agnes Stancel-Piątak IEA Hamburg Hamburg, Hamburg, Germany

Jan-Eric Gustafsson University of Gothenburg Gothenburg, Sweden

ISSN 2197-1951 ISSN 2197-196X (electronic) Springer International Handbooks of Education ISBN 978-3-030-88177-1 ISBN 978-3-030-88178-8 (eBook) https://doi.org/10.1007/978-3-030-88178-8 © Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Some time ago, we were extremely pleased to see our mailboxes being filled up with positive responses to our invitations to contribute chapters to the handbook. We knew we had aimed high and were hence thrilled to see that the general message of the feedback was that this is an excellent initiative and that such a handbook is sorely missed and highly needed. Almost all of the large number of distinguished researchers we invited accepted our invitation. Indeed, there was an agreement that a handbook on international large-scale assessment (ILSA) is needed. ILSAs have affected educational research, contributed to the development of methodology and theory, and have had a large impact on educational policy. After about 60 years of international large-scale studies and after more than 25 years since the first of the current ILSAs (TIMSS 1995) was established, huge developments are evident. Enormous amounts of data have been produced, which have been made freely available online. This accumulation of highquality data and research calls for reviews to collect and summarize all the extant evidence. Such reviews have been one of the main goals of the handbook. At the same time, myths and legends tend to proliferate media and other outlets, and ILSAs have been made targets of both inaccurate and accurate critique. ILSAs heavily depend on criticism to improve as they impact both research and policy. It is therefore important that the criticism is valid and that it reaches a broad audience of educational policymakers and stakeholders, students and beginning researchers, and researchers who represent many different fields. Furthermore, researchers involved in ILSAs need knowledge on how to use a large set of tools, for example to analyze data, make causal inferences, create robust conclusions, and present findings. The handbook hence aims to gather and present information from several perspectives, including theory, history, policy, methodology, and ILSA findings. Such a broad and all-encompassing handbook needed careful planning. We aimed for an inclusive handbook that would present different perspectives and mirror various research traditions. It was important for us to include both Western and non-Western perspectives, as well as including voices from the ILSA environments, but also critical voices. And we wanted to review findings from studies that have used ILSA data from all major fields. This was a large endeavor that demanded a lot of planning. Thus, we would like to express our gratitude to the section editor Prof. Sigrid Blömeke for her critique and fruitful suggestions that helped us reach our aim v

vi

Preface

of a handbook embracing cultural diversity and multiple perspectives. Moreover, we would like to thank the section editor Prof. Ronny Scherer for his contributions to the development and the extensive help with the review process and to the section editors Prof. David Kaplan and Sarah Howie for their patience, insights, and support. We are very grateful to all authors who contributed to the book with their excellent research, their insightful thoughts and reflections, and their sophisticated constructive criticism on ILSA. Finally, we would like to extend our special thanks to the reviewers who have worked on the handbook conducting thorough reviews and providing extensive comments to the submitted chapters. Numerous reviewers have worked on the handbook, despite the pandemic situation and the difficulties everyone had to face during this time. Their contribution to the publication is enormous. The different strengths and profiles of the editors have contributed to excellent collaborations and enhanced the quality of the process and product. We hope the readers will enjoy the insights and knowledge from the many authors of this handbook. Oslo, Norway Hamburg, Germany Gothenburg, Sweden August 2022

Trude Nilsen Agnes Stancel-Piątak Jan-Eric Gustafsson

Contents

Volume 1 Part I 1

2

About This Handbook

...............................

1

Introduction to the Comparative Large-Scale Studies in Education: Structure and Overview of the Handbook . . . . . . . . . . Sigrid Blömeke, Sarah J. Howie, David Kaplan, and Ronny Scherer

3

Background, Aims, and Theories of the Comparative Large-Scale Studies in Education . . . . . . . . . . . . . . . . . . . . . . . . . . Trude Nilsen, Jan-Eric Gustafsson, and Agnes Stancel-Piątak

13

Part II Meta-perspectives on ILSAs: Theoretical Meta-perspectives on ILSAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

4

5

6

7

25

The Political Economy of ILSAs in Education: The Role of Knowledge Capital in Economic Growth . . . . . . . . . . . . . . . . . . Eric A. Hanushek and Ludger Woessmann

27

Reasons for Participation in International Large-Scale Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ji Liu and Gita Steiner-Khamsi

55

Educational Accountability and the Role of International Large-Scale Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Susanna Loeb and Erika Byun

75

International Large-Scale Assessments and Education System Reform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Ehren

97

The Role of International Large-Scale Assessments (ILSAs) in Economically Developing Countries . . . . . . . . . . . . . . . . . . . . . . . . Syeda Kashfee Ahmed, Michelle Belisle, Elizabeth Cassity, Tim Friedman, Petra Lietz, and Jeaniene Spink

119

vii

viii

Contents

Part III Meta-perspectives on ILSAs: The Role of Theory in ILSAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

143

8

Comprehensive Frameworks of School Learning in ILSAs . . . . . . Agnes Stancel-Piątak and Knut Schwippert

145

9

Assessing Cognitive Outcomes of Schooling . . . . . . . . . . . . . . . . . . Frederick Koon Shing Leung and Leisi Pei

175

10

Socioeconomic Inequality in Achievement . . . . . . . . . . . . . . . . . . . Rolf Strietholt and Andrés Strello

201

11

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be Sure that They Measure What They Are Supposed to Measure? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hans Luyten and Jaap Scheerens

12

Using ILSAs to Promote Quality and Equity in Education: The Contribution of the Dynamic Model of Educational Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leonidas Kyriakides, Charalambos Y. Charalambous, and Evi Charalambous

Part IV

Meta-perspectives on ILSAs: Characteristics of ILSAs . . . .

221

253

277

13

Overview of ILSAs and Aspects of Data Reuse . . . . . . . . . . . . . . . Nathalie Mertes

14

IEA’s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ina V. S. Mullis and Michael O. Martin

305

IEA’s Teacher Education and Development Study in Mathematics (TEDS-M) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sigrid Blömeke

325

15

16

OECD Studies and the Case of PISA, PIAAC, and TALIS . . . . . . Andreas Schleicher, Miyako Ikeda, William Thorn, and Karine Tremblay

17

Regional Studies in Non-Western Countries, and the Case of SACMEQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sarah J. Howie

18

The Use of Video Capturing in International Large-Scale Assessment Studies: Methodological and Theoretical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kirsti Klette

279

379

419

469

Contents

19

ix

Comparison of Studies: Comparing Design and Constructs, Aligning Measures, Integrating Data, Cross-validating Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eckhard Klieme

Part V Meta-perspectives on ILSAs: Accomplishments, Limitations, and Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . 20

21

22

511

545

ILSA in Arts Education: The Effect of Drama on Competences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rikke Gürgens Gjærum, Adam Cziboly, and Stig A. Eriksson

547

Future Directions, Recommendations, and Potential Developments of ILSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dirk Hastedt and Heiko Sibberns

579

Conceptual and Methodological Accomplishments of ILSAs, Remaining Criticism and Limitations . . . . . . . . . . . . . . . . . . . . . . . Sigrid Blömeke, Trude Nilsen, Rolf V. Olsen, and Jan-Eric Gustafsson

603

Volume 2 Part VI

Methodology: Designing and Implementing ILSAs . . . . . .

657

23

Sampling Design in ILSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sabine Meinck and Caroline Vandenplas

659

24

Designing Measurement for All Students in ILSAs . . . . . . . . . . . . David Rutkowski and Leslie Rutkowski

685

25

Implementing ILSAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juliette Lyons-Thomas, Kadriye Ercikan, Eugene Gonzalez, and Irwin Kirsch

701

26

Dilemmas in Developing Context Questionnaires for International Large-Scale Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Hooper

721

Multistage Test Design Considerations in International Large-Scale Assessments of Educational Achievement . . . . . . . . . Leslie Rutkowski, David Rutkowski, and Dubravka Svetina Valdivia

749

27

Part VII

Methodology: Methods of Analysis . . . . . . . . . . . . . . . . . .

769

28

Secondary Analysis of Large-scale Assessment Databases . . . . . . . Eugenio J. Gonzalez

771

29

Methods of Causal Analysis with ILSA Data . . . . . . . . . . . . . . . . . Jan-Eric Gustafsson and Trude Nilsen

803

x

Contents

30

Trend Analysis with International Large-Scale Assessments . . . . . David Kaplan and Nina Jude

831

31

Cross-Cultural Comparability of Latent Constructs in ILSAs Jia He, Janine Buchholz, and Jessica Fischer

...

845

32

Analyzing International Large-Scale Assessment Data with a Hierarchical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ronny Scherer

871

33

Process Data Analysis in ILSAs . . . . . . . . . . . . . . . . . . . . . . . . . . . Denise Reis Costa and Waldir Leoncio Netto

Part VIII Methodology: Potential and Methods of Linking ILSA to National Education Policy and Research . . . . . . . . . . . . . . . 34

Extending the ILSA Study Design to a Longitudinal Design . . . . . David Greger, Jana Straková, and Patrícia Martinková

35

Extending International Large-Scale Assessment Instruments to National Needs: Germany’s Approach to TIMSS and PIRLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heike Wendt and Knut Schwippert

36

37

Extending the Sample to Evaluate a National Bilingual Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francisco Javier García-Crespo, María del Carmen Tovar-Sánchez, and Ruth Martín-Escanilla

953 955

979

995

A Non-Western Perspective on ILSAs . . . . . . . . . . . . . . . . . . . . . . 1021 Oliver Neuschmidt, Zuwaina Saleh Issa Al-Maskari, and Clara Wilsher Beyer

Part IX 38

927

Findings: Schools, Principals, and Institutions . . . . . . . . . .

1051

A Systematic Review of Studies Investigating the Relationships Between School Climate and Student Outcomes in TIMSS, PISA, and PIRLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053 Trude Nilsen and Nani Teig

Part X

Findings: Classrooms, Teachers, and Curricula . . . . . . . . . .

1087

39

Teaching Quality and Student Outcomes in TIMSS and PISA . . . 1089 Eckhard Klieme and Trude Nilsen

40

Inquiry in Science Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1135 Nani Teig

Contents

xi

41

Teacher Competence and Professional Development . . . . . . . . . . . 1167 Armin Jentsch and Johannes König

42

Teachers’ Beliefs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185 Heather E. Price

43

Homework: Facts and Fiction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1209 Rubén Fernández-Alonso and José Muñiz

Part XI

Findings: Students, Competences, and Dispositions . . . . .

1241

44

International Achievement in Mathematics, Science, and Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1243 Ina V. S. Mullis and Dana L. Kelly

45

Digital Competences: Computer and Information Literacy and Computational Thinking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273 Wolfram Schulz, Julian Fraillon, John Ainley, and Daniel Duckworth

46

Student Motivation and Self-Beliefs . . . . . . . . . . . . . . . . . . . . . . . . 1299 Hanna Eklöf

47

Well-Being in International Large-Scale Assessments . . . . . . . . . . 1323 Francesca Borgonovi

Part XII

Findings: Equity and diversity . . . . . . . . . . . . . . . . . . . . . .

1349

48

Gender Differences in School Achievement . . . . . . . . . . . . . . . . . . 1351 Monica Rosén, Isa Steinmann, and Inga Wernersson

49

Dispersion of Student Achievement and Classroom Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1399 Camilla Rjosk

50

Perspectives on Equity: Inputs Versus Outputs . . . . . . . . . . . . . . . 1433 Emma García

51

Family Socioeconomic and Migration Background Mitigating Educational-Relevant Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . 1459 Victoria Rolfe and Kajsa Yang Hansen

Part XIII 52

Concluding Remarks

.............................

1493

60-Years of ILSA: Where It Stands and How It Evolves . . . . . . . . 1495 Agnes Stancel-Piątak, Trude Nilsen, and Jan-Eric Gustafsson

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1519

About the Editors

Trude Nilsen is a research professor in the Department of Teacher Education and School Research at the University of Oslo, Norway. She holds a PhD in science education, and her thesis was awarded the IEA Bruce H. Choppin Memorial Award. She is the leader of the research group Large-scale Educational Assessment (LEA) at her department and a leader of the large project Teachers’ Effect on Student Outcome (TESO). TESO is funded by the Norwegian Research Council and is a longitudinal extension of TIMSS. In addition, her work comprises analyzing data from the main ILSAs and supervising PhD students. Of all the ILSAs, TIMSS has been the main focus since 2010, and she is, and has been, an international external expert in the Questionnaire Item Review Committee (QIRC) for TIMSS 2019 and 2023. She has also worked as an external expert for the OECD in the Questionnaire Expert Group (QEG) for TALIS 2018 and the TALIS Starting Strong 2018. Nilsen has published a large number of articles in international and national journals as well as books, such as the award-winning and widely disseminated Springer book Teacher Quality, Instructional Quality and Student Outcome: Relationships Across Countries, Cohorts and Time. These publications primarily pertain to the areas of educational and teacher effectiveness. Using ILSA data, she focuses on school climate and teacher instructional quality as well as educational equality and applied methodology including causal inferences.

xiii

xiv

About the Editors

Agnes Stancel-Piątak is a Senior Researcher at the International Association for the Evaluation of the Educational Achievement (IEA Hamburg, Germany). She holds a PhD in education. Her scientific work focuses primarily on social justice and educational effectiveness, on the one hand, and on study and test development and implementation, on the other. Moreover, in the context of large-scale assessments her interest is on overarching concepts describing the school system. Applied to ILSAs, these concepts contribute to enhance the theoretical framing and to strengthen the alignment between different ILSAs. Agnes Stancel-Piątak acts as an expert for the IEA, the Organisation for Economic Co-operation and Development (OECD), and other international scientific organizations and national governmental entities in test construction and complex methods of data analysis. She was and is a member of the Questionnaire Expert Group (QEG) for TALIS 2018, TALIS Starting Strong Survey 2018, and the upcoming TALIS 2024. She leads the scaling team within the IEA Hamburg which is responsible for the psychometric foundation of these studies. Jan-Eric Gustafsson is professor emeritus of education at the University of Gothenburg. His research has primarily focused on basic and applied topics within the field of educational psychology. Three main categories of questions have been in focus: individual prerequisites for education; effects of resources for and organization of education; and educational outcomes at individual and system levels. The research on individual prerequisites for education has in particular focused on the structure of cognitive abilities that has resulted in a hierarchical model which integrates several previous models of individual differences in cognitive abilities. The research on educational determinants and outcomes has primarily been based on data collected within the international large-scale assessments, and particularly so in the studies conducted by IEA. Gustafsson has also been involved in development of quantitative methodology, focusing on issues of measurement with the Rasch model and on techniques for statistical analysis with latent variable models. He was for almost two decades a member of the IEA Technical Executive

About the Editors

xv

Group, which advises on design, analysis, and reporting of the IEA studies. To an increasing extent, he has also become involved in national educational policy issues and has chaired a governmental school commission, aiming for improvements of the quality and equity of the Swedish school system. In 1993, he was elected member of the Royal Swedish Academy of Sciences and in 2020 he was elected Corresponding Fellow of the British Academy.

Section Editors

Sigrid Blömeke University of Oslo Oslo, Norway

Sarah Howie Africa Centre for Scholarship Stellenbosch University Stellenbosch, South Africa

David Kaplan University of Wisconsin–Madison Madison, WI, USA

xvii

xviii

Section Editors

Ronny Scherer University of Oslo Oslo, Norway

Contributors

Syeda Kashfee Ahmed EMR, ACER, Adelaide, Australia John Ainley Australian Council for Educational Research, Camberwell, VIC, Australia Michelle Belisle EMR, ACER, Adelaide, Australia Clara Wilsher Beyer IEA Hamburg, Hamburg, Germany Sigrid Blömeke University of Oslo, Oslo, Norway Francesca Borgonovi University College London, London, UK Janine Buchholz Leibniz Institute for Research and Information in Education, DIPF, Frankfurt, Germany Erika Byun The Wharton School, Philadelphia, PA, USA Elizabeth Cassity EMR, ACER, Adelaide, Australia Charalambos Y. Charalambous Department of Education, University of Cyprus, Nicosia, Cyprus Evi Charalambous Department of Education, University of Cyprus, Nicosia, Cyprus Adam Cziboly Western Norway University of Applied Sciences, Bergen, Norway Daniel Duckworth Australian Council for Educational Research, Camberwell, VIC, Australia M. Ehren Vrije Universiteit Amsterdam FGB Boechorststraat, Amsterdam, The Netherlands University College London, Institute of Education, London, UK Hanna Eklöf Umeå University, Umeå, Sweden Kadriye Ercikan Educational Testing Service, Princeton, NJ, USA xix

xx

Contributors

Stig A. Eriksson Western Norway University of Applied Sciences, Bergen, Norway Rubén Fernández-Alonso Department of Education, Government of Principado de Asturias, Oviedo, Spain University of Oviedo, Oviedo, Spain Jessica Fischer German Institute for Adult Education, Leibniz Centre for Lifelong Learning, Bonn, Germany Julian Fraillon Australian Council for Educational Research, Camberwell, VIC, Australia Tim Friedman EMR, ACER, Adelaide, Australia Emma García Economic Policy Institute and Center for Benefit-Cost Studies of Education, Teachers College, Columbia University, New York, NY, USA Francisco Javier García-Crespo Universidad Complutense de Madrid, Madrid, Spain Instituto Nacional de Evaluación Educativa, Madrid, Spain Rikke Gürgens Gjærum UiT – The Arctic University of Norway, Harstad, Norway Eugene Gonzalez Educational Testing Service, Princeton, NJ, USA Eugenio J. Gonzalez ETS, Princeton, NJ, USA David Greger IRDE, Faculty of Education, Charles University, Prague, Czechia Jan-Eric Gustafsson University of Gothenburg, Gothenburg, Sweden Eric A. Hanushek Stanford University, Stanford, CA, USA CESifo, Munich, Germany National Bureau of Economic Research, Cambridge, MA, USA IZA, Bonn, Germany Dirk Hastedt IEA, Amsterdam, Netherlands Jia He Leibniz Institute for Research and Information in Education, DIPF, Frankfurt, Germany Tilburg University, Tilburg, The Netherlands Martin Hooper American Institutes for Research, Waltham, MA, USA Sarah J. Howie Africa Centre for Scholarship, Stellenbosch University, Stellenbosch, South Africa Miyako Ikeda OECD, Paris, France Zuwaina Saleh Issa Al-Maskari Ministry of Education, Muscat, Oman

Contributors

xxi

Armin Jentsch University of Hamburg, Faculty of Education, Hamburg, Germany Nina Jude University of Heidelberg, Heidelberg, Germany David Kaplan University of Wisconsin–Madison, Madison, WI, USA Dana L. Kelly TIMSS & PIRLS International Study Center, Boston College, Chestnut Hill, MA, USA Irwin Kirsch Educational Testing Service, Princeton, NJ, USA Kirsti Klette University of Oslo, Oslo, Norway Eckhard Klieme DIPF | Leibniz Institute for Educational Research and Information, Frankfurt am Main, Germany Johannes König University of Cologne, Faculty of Human Sciences, Cologne, Germany Leonidas Kyriakides Department of Education, University of Cyprus, Nicosia, Cyprus Waldir Leoncio Netto Oslo Centre for Biostatistics and Epidemiology (OCBE), University of Oslo, Oslo, Norway Frederick Koon Shing Leung The University of Hong Kong, Hong Kong, Hong Kong SAR School of Mathematics and Statistics, Southwest University, Chongqing, China Petra Lietz EMR, ACER, Adelaide, Australia Ji Liu Faculty of Education, Shaanxi Normal University, Xi’an, People’s Republic of China Susanna Loeb Harvard Kennedy School, Cambridge, MA, USA Hans Luyten University of Twente, Enschede, The Netherlands Juliette Lyons-Thomas Educational Testing Service, Princeton, NJ, USA Ruth Martín-Escanilla Instituto Nacional de Evaluación Educativa, Madrid, Spain Michael O. Martin TIMSS & PIRLS International Study Center, Boston College, Chestnut Hill, MA, USA Patrícia Martinková IRDE, Faculty of Education, Charles University, Prague, Czechia Sabine Meinck IEA, Hamburg, Germany Nathalie Mertes RandA, IEA, Hamburg, Germany José Muñiz Nebrija University, Madrid, Spain

xxii

Contributors

Ina V. S. Mullis TIMSS & PIRLS International Study Center, Boston College, Chestnut Hill, MA, USA Oliver Neuschmidt International Studies, IEA Hamburg, Hamburg, Germany Trude Nilsen Department of Teacher Education and School Research, Faculty of Educational Sciences, University of Oslo, Oslo, Norway Rolf V. Olsen University of Oslo, Oslo, Norway Leisi Pei The University of Hong Kong, Hong Kong, Hong Kong SAR Heather E. Price Loyola University Chicago, Chicago, IL, USA Denise Reis Costa Centre for Educational Measurement (CEMO), University of Oslo, Oslo, Norway Camilla Rjosk Institute for Educational Quality Improvement (IQB), Humboldt University Berlin, Berlin, Germany Victoria Rolfe Department of Education and Special Education, Gothenburg University, Gothenburg, Sweden Monica Rosén University of Gothenburg, Gothenburg, Sweden David Rutkowski Indiana University, Bloomington, IN, USA Centre for Educational Measurement, University of Oslo, Oslo, Norway Leslie Rutkowski Indiana University, Bloomington, IN, USA Centre for Educational Measurement, University of Oslo, Oslo, Norway Jaap Scheerens University of Twente, Enschede, The Netherlands Ronny Scherer Centre for Educational Measurement at the University of Oslo (CEMO), Faculty of Educational Sciences, University of Oslo, Oslo, Norway Andreas Schleicher OECD, Paris, France Wolfram Schulz Australian Council for Educational Research, Camberwell, VIC, Australia Knut Schwippert Faculty for Educational Sciences, University of Hamburg, Hamburg, Germany Heiko Sibberns IEA, Hamburng, Germany Jeaniene Spink EMR, ACER, Adelaide, Australia Agnes Stancel-Piątak IEA Hamburg, Hamburg, Germany Gita Steiner-Khamsi Teachers College, Columbia University, New York, NY, USA Isa Steinmann University of Oslo, Oslo, Norway

Contributors

xxiii

Jana Straková IRDE, Faculty of Education, Charles University, Prague, Czechia Andrés Strello IEA, Hamburg, Germany Rolf Strietholt IEA, Hamburg, Germany Dubravka Svetina Valdivia Indiana University, Bloomington, IN, USA Nani Teig Department of Teacher Education and School Research, University of Oslo, Blindern, Oslo, Norway William Thorn OECD, Paris, France María del Carmen Tovar-Sánchez Instituto Nacional de Evaluación Educativa, Madrid, Spain Karine Tremblay OECD, Paris, France Caroline Vandenplas B12 Consulting, Ottignies-Louvain-la-Neuve, Belgium Heike Wendt Faculty of Environmental, Regional and Educational Sciences, University of Graz, Graz, Austria Inga Wernersson University West, Trollhättan, Sweden Ludger Woessmann University of Munich and ifo Institute, Munich, Germany CESifo, Munich, Germany IZA, Bonn, Germany Kajsa Yang Hansen Department of Education and Special Education, Gothenburg University, Gothenburg, Sweden

Inga Wernersson: deceased.

Part I About This Handbook

1

Introduction to the Comparative LargeScale Studies in Education: Structure and Overview of the Handbook Sigrid Blo¨meke, Sarah J. Howie, David Kaplan, and Ronny Scherer

Contents Part 1: Meta-Perspectives on International Large-Scale Assessments in Education . . . . . . . . . . . . The Societal Role of International Large-Scale Assessments in Education . . . . . . . . . . . . . . . . . Theoretical Foundations of International Large-Scale Assessments in Education . . . . . . . . . . Functions and Characteristics of International Large-Scale Assessments in Education . . . . . Accomplishments, Limitations, and Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part 2: Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Designing and Implementing ILSAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods of Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Potential and Methods of Linking ILSA to National Education Policy and Research . . . . . . Part 3: Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Schools, Principals, and Institutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classrooms, Teachers, and Curricula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Students, Competences, and Dispositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Equity and Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 5 6 6 6 8 9 9 9 10 11 12 12 12

S. Blömeke (*) University of Oslo, Oslo, Norway e-mail: [email protected] S. J. Howie Africa Centre for Scholarship, Stellenbosch University, Stellenbosch, South Africa e-mail: [email protected] D. Kaplan University of Wisconsin-Madison, Madison, WI, USA e-mail: [email protected] R. Scherer Centre for Educational Measurement at the University of Oslo (CEMO), Faculty of Educational Sciences, University of Oslo, Oslo, Norway e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_1

3

S. Blo¨meke et al.

4

Abstract

The International Handbook of Comparative Large-Scale Studies in Education presents theoretical and empirical perspectives on international large-scale assessments (ILSAs) in education, their methodology, and key findings in three parts. Each of these parts is divided into sections (see Fig. 1). Keywords

International large-scale assessments · Theory · Methodology · Findings

Part 1: Meta-Perspectives on International Large-Scale Assessments in Education Section editors: Sigrid Blömeke (University of Oslo, Norway) & Sarah Howie (University of Stellenbosch, South Africa) International large-scale assessments (ILSAs) have demonstrated the power to influence educational policy and curricular reforms. It is hence important to be aware of their aims and functions, theoretical foundations, their methodological potential and

About this handbook

Part 1: Meta-Perspectives on International Large-Scale Assessments in Education The societal role of ILSAs in education

Theoretical foundations of ILSAs in education

Functions and characteristics of ILSAs in education

Accomplishments, limitations, and recommendations

Part 2: Methodology Secon editors: Sigrid Blömeke & Sarah Howie

Designing and implementing ILSAs

Methods of Analysis

Potential and methods of linking ILSA to national education policy and research

Part 3: Findings Secon editors: Sigrid Blömeke & Sarah Howie Schools, principals, and institutions

Classrooms, teachers, and curricula

Students, competences, and dispositions

Equity and diversity

Discussion

Fig. 1 Overview of the structure of the International Handbook of Comparative Large-Scale Studies in Education

1

Introduction to the Comparative Large-Scale Studies in Education:. . .

5

limitations. A deeper understanding of the ILSAs may further be gained through examining the historical developments of the ILSAs, and reflecting on the mutual influence between ILSAs on the one hand and society, policymaking, and research on the other. Against this background, the purpose of the first part of the International Handbook of Comparative Large-Scale Studies in Education is to provide an overview of the history, theory, and methodology of ILSAs as a modern, worldwide phenomenon, including an overview of the role theory played within ILSAs. Furthermore, this part of the handbook discusses the accomplishments and limitations of ILSAs from a broad range of disciplinary approaches and provides an outlook to the future of ILSAs. This part of the international handbook adopts thus a meta-perspective on ILSAs, reflecting on their role in society, economy, educational research, theory building, and educational policy. Although ILSAs are currently a worldwide phenomenon, most of the literature about ILSAs is still of a descriptive or technical nature. Part 1 of the handbook is therefore aimed at moving beyond the state of research by providing illustrations of how different academic disciplines are analyzing and theorizing the appearance and effects of ILSAs and evaluating their contributions and consequences. To meet the objective of providing a meta-perspective, the present part of the handbook optimizes the diversity of disciplines and scholarly traditions by including systematically authors stemming from educational history, sociology, and economy as well as from curriculum theory, educational policy, comparative education, psychometrics, educational measurement, assessment, and evaluation to gather new perspectives and insights into ILSAs.

The Societal Role of International Large-Scale Assessments in Education More than half of the countries worldwide are taking part in one or the other ILSA. This does not only apply to global enterprises such as TIMSS or PISA but also to regional enterprises such as SACMEQ (Southern and Eastern Africa Consortium for Monitoring Educational Quality) or LLECE (Laboratorio Latinoamericano de Evaluación de la Calidad de la Educación – Latin-American Laboratory for Assessment of the Quality of Education). This section provides an overview of the development of these studies and an understanding why this movement took place from a global perspective interrogating the political, economic, accountability, quality assurance, and reform motivations. In addition, the section will discuss how ILSAs are related to within-country developments of the school systems and motives of countries or educational systems to take part in ILSAs despite at times local criticism and negative media coverage. Finally, while most of the research and discussions take place in highly developed countries, a particular chapter is devoted to the role of ILSAs and regional studies in developing countries and their link to evidence-based policymaking. To cover the research questions to be addressed in this section from plural perspectives a broad range of disciplinary approaches has been invited to take part in the discussion.

6

S. Blo¨meke et al.

Theoretical Foundations of International Large-Scale Assessments in Education The measures used in ILSAs are selected based on comprehensive frameworks that represent different models of school learning. These frameworks and models are most visible when assessing the role of opportunities to learn and classroom context characteristics for student achievement. This chapter presents the most common frameworks applied to ILSAs and explains why certain type of measures are included or not. In addition, the theoretical background of core measures used in the different ILSAs are summarized, such as cognitive outcomes of schooling in terms of achievement, affective-motivational outcomes in terms of motivation and interest, the measures of student background, and opportunities to learn. Various chapters reflect upon the cross-pollination of theoretical contributions to and from ILSAs research in education, psychology, and sociology. The section concludes with the contribution of the dynamic model of educational effectiveness to promote quality and equity in education.

Functions and Characteristics of International Large-Scale Assessments in Education In 1964, there was one ILSA in education; today, there are almost twenty. Why so many? What are their functions, their aims, what do they measure and how? And what is the difference between these ILSAs? These questions are addressed in this section that provides a thorough description of the functions and characteristics of ILSAs. In addition, a number of chapters focus on the aims, frameworks, and designs of IEA and OECD as well as of larger regional studies. In addition to the studies focusing on student outcomes at the school level, studies on teacher education, teachers, and adult literacy are included. Furthermore, the opportunities and potential around the use of video technology for ILSA are argued for. The challenge of assessing the outcomes of an aesthetic subject cross-nationally is also presented. Finally, the section ends with an examination of the design and conceptualization of studies and what these mean for study designs and instrument development.

Accomplishments, Limitations, and Recommendations The concluding section of part 1 is about evaluating what ILSAs have achieved over the course of the almost six decades they have been carried out in addition to reflecting on what the limitations of the ILSAs are. The comprehensive chapters in this section point to the strengths, the vulnerabilities as well as to the progress and opportunities for ILSAs: The first chapter outlines the historical developments over time and provides projections for the future making recommendations for each potential future direction. The second chapter summarizes and then interrogates the conceptual, methodological, and empirical accomplishments and limitations of

1

Introduction to the Comparative Large-Scale Studies in Education:. . .

7

ILSA reflecting, among others, upon the lessons derived from the previous chapters. This final chapter summarizes and reflects thus on the meta-perspectives taken in part 1: ILSAs’ societal role, their theoretical foundations, functions, and characteristics are thoroughly interrogated by balancing their benefits without losing challenges out of sight (Table 1). Table 1 Overview of the handbook section “Meta-Perspectives on and Characteristics of International Large-Scale Assessments in Education” Meta-Perspectives on International Large-Scale Assessments in Education The Political Economy of ILSAs in Education: Eric A. Hanushek & Ludger Woessmann The Role of Knowledge Capital in Economic Growth Reasons for Participation in International Ji Liu & Gita Steiner-Khamsi Large-Scale Assessments Educational Accountability and the Role of Susanna Loeb & Erika Byun International Large-Scale Assessments International Large-Scale Assessments and Melanie Ehren Education System Reform: On the Power of Numbers The Role of ILSAs in Economically Syeda Kashfee Ahmed, Michelle Belisle, Developing Countries Elizabeth Cassity, Tim Friedman, Petra Lietz & Jeaniene Spink Theoretical Foundations of International Large-Scale Assessments in Education Comprehensive Frameworks of School Agnes Stancel-Piątak & Knut Schwippert Learning in ILSAs Assessing Cognitive Outcomes of Schooling Frederick Koon Shing Leung & Leisi Pei Socioeconomic Inequality in Achievement: Rolf Strietholt & Andrés Strello Conceptual Foundations and Empirical Measurement Measures of Opportunity to Learn Hans Luyten & Jaap Scheerens Mathematics in PISA and TIMSS: Can We Be Sure that They Measure What They Are Supposed to Measure? Using ILSAs to Promote Quality and Equity in Leonidas Kyriakides, Charalambos Education: The Contribution of the Dynamic Y. Charalambous, & Evi Charalambous Model of Educational Effectiveness Functions and Characteristics of International Large-Scale Assessments in Education Overview of ILSAs and Aspects of Data Reuse Nathalie Mertes IEA’s TIMSS and PIRLS: Measuring LongIna Mullis & Michael Martin Term Trends in Student Achievement IEA’s Teacher Education and Development Sigrid Blömeke Study in Mathematics (TEDS-M): Framework and Findings from 17 Countries OECD Studies and the Case of PISA, PIAAC, Andreas Schleicher, Miyako Ikeda, William and TALIS Thorn, & Karine Tremblay Regional Studies in Non-Western countries, Sarah Howie and the Case of SACMEQ (continued)

8

S. Blo¨meke et al.

Table 1 (continued) The Use of Video Capturing in International Kirsti Klette Large-Scale Assessment Studies: Methodological and Theoretical Considerations The Case of Drama: ILSA Exemplified in Arts Rikke Gürgens Gjærum, Adam Cziboly, & Education. What Learning Competences Can Stig A. Eriksson Be Developed Through Drama Education? Comparison of studies: Comparing Design and Eckhard Klieme Constructs, Aligning Measures, Integrating Data, Cross-Validating Findings ILSAs in Education: Accomplishments, Limitations, and Recommendations Future directions and Recommendations, Dirk Hastedt & Heiko Sibberns Potential Developments of ILSA ILSAs: Accomplishments, Limitations, and Sigrid Blömeke, Rolf Vegar Olsen, Trude Controversies – A Summary Nilsen, & Jan-Eric Gustafsson

Part 2: Methodology Section editor: David Kaplan (University of Wisconsin-Madison, USA) This part of the handbook focuses on the methodology of international largescale assessments. ILSAs are, by their very nature, extremely complex, and these complexities result from the requirement that ILSAs shall provide reliable and valid measures of academic and nonacademic constructs that are representative of student, teacher, and school populations of interest. Almost all ILSAs currently operating address these complexities via the formation of methodological expert groups that advise on the sampling design of the assessment as well as the development and evaluation of assessment instruments. In turn, decisions that are made regarding the methodology of ILSAs often inform innovative approaches to secondary analyses needed to guide policymakers. Still, despite the use of state-of-the-art methodology in the design, implementation, and analysis of ILSAs, creative methods are sometimes needed to link the cross-system comparability of ILSAs to national policy needs. This part of the handbook consists of chapters organized around (a) the design and implementation of ILSAs, (b) traditional and innovative methods for the analysis of ILSAs, and (c) possible extensions of ILSAs to meet national policy needs.

1

Introduction to the Comparative Large-Scale Studies in Education:. . .

9

Designing and Implementing ILSAs This section focuses on sampling design issues that are common across the major ILSAs currently in operation with special focus on the elements of the sampling designs that need to be considered to ensure proper inferences. In addition to the sampling design, this section also focuses on measurement issues associated with academic and nonacademic constructs of interest. Importantly, this section also includes a discussion of controversies centered on the measurement of academic and nonacademic constructs for low-performing countries.

Methods of Analysis Taking the design and implementation of the ILSAs as given, the next question concerns the analysis of the data. ILSAs not only inform policymakers, but also serve as rich sources of data to advance basic research. Such questions as the cross-cultural comparability of academic and nonacademic constructs, the potential for causal analysis with ILSAs, the analysis of ILSA trend data obtained at the system level, the estimation of school-level policies and practices on student outcomes, and the analysis of available process data continues to drive empirical research in education. However, secondary analyses must take into account the complexities of the design as was outlined in the previous section. Thus, this section provides an overview of traditional and innovative approaches to the secondary analysis of ILSA data and offers the reader an insight into best methodological practices.

Potential and Methods of Linking ILSA to National Education Policy and Research International large-scale assessments are powerful tools for cross-national comparisons. However, given practical constraints on the design and implementation of ILSAs, particularly limited administration time and its consequences for the selection of cross-nationally agreed-upon constructs, ILSAs might not meet the specific needs of a given country. To counter this problem, methodological tools can be utilized to link ILSAs to specific national data. Such linkages can, in principle, yield country-specific trend data, allow for national adaptations and add-ons of countryspecific constructs, and provide evaluations of country-specific programs. This section provides an overview of methodological approaches to linking ILSA data to national data with specific national policy issues serving as contexts (Table 2).

10

S. Blo¨meke et al.

Table 2 Overview of the second part of the international handbook: Methodology Designing and Implementing ILSAs Sampling Design in ILSA Designing Measurement for All Students in ILSAs Implementing ILSAs

Sabine Meinck David Rutkowski & Leslie Rutkowski Juliette Lyons-Thomas, Kadriye Ercikan, Eugenio Gonzalez, & Irwin Kirsch Martin Hooper

Dilemmas in Developing Context Questionnaires for International Large-Scale Assessments Multistage Test Design Considerations in Leslie Rutkowski, David Rutkowski, & International Large-Scale Assessments of Dubravka Svetina Valdivia Educational Achievement Methods of Analysis Secondary Analysis of Large-scale Assessment Eugenio Gonzalez Databases: Challenges, Limitations, and Recommendations Methods of Causal Analysis with ILSA Data Jan-Eric Gustafsson & Trude Nilsen Trend Analysis with International Large-Scale David Kaplan & Nina Jude Assessments: Past Practice, Current Issues, and Future Directions Cross-Cultural Comparability of Latent Jia He, Janine Buchholz, & Jesscia Fischer Constructs in ILSAs Analyzing International Large-Scale Assessment Ronny Scherer Data with a Hierarchical Approach Process Data Analysis in ILSAs Denise Reis Costa Potential and Methods of Linking ILSA to National Education Policy and Research Extending the ILSA Study Design to a David Greger, Jana Strakova, & Patricia Longitudinal Design: TIMSS & PIRLS Martinkova Extension in the Czech Republic – CLoSE Study Extending International Large-Scale Assessment Knut Schwippert Instruments to National Needs: Germany’s Approach to TIMSS and PIRLS Extending the Sample to Evaluate a National Carmen Tovar Sanchez, Francisco Javier Bilingual Program: Madrid García Crespo, & Ruth María Martín Escanilla A Non-Western Perspective on ILSAs: The Case Oliver Neuschmidt, Zuwaina Saleh Issa of the Gulf States Al-Maskari, & Clara Wilsher Beyer

Part 3: Findings Section editor: Ronny Scherer (University of Oslo, Norway) This part of the handbook reviews selected findings from international largescale assessments that are relevant to educational research, practice, and policymaking. These findings are compared with and integrated in the current state of specific fields of educational research (for example, reported by reviews, meta-analyses, international study reports, experimental, cross-sectional, and

1

Introduction to the Comparative Large-Scale Studies in Education:. . .

11

Table 3 Overview of the third part of the handbook: Findings Schools, Principals, and Institutions A Systematic Review of Studies Investigating the Relationships Between School Climate and Student Outcomes in TIMSS, PISA, and PIRLS Classrooms, Teachers, and Curricula Instructional Quality in TIMSS and PISA Inquiry in Science Education Teacher Competence and Professional Development Teachers’ Beliefs Homework: Facts and Fiction Students, Competences, and Dispositions International Achievement in Mathematics, Science, and Reading Digital Competences: Computer and Information Literacy and Computational Thinking Student Motivation and Self-Beliefs Well-being in International Large-Scale Assessments Equity and Diversity Gender Differences in School Achievement Dispersion of Student Achievement and Classroom Composition Perspectives on Equity by SES, Gender, Age, and Immigrant Status Family Socioeconomic and Migration Background Mitigating Educational-Relevant Inequalities: Findings from TIMSS, PIRLS, and PISA

Trude Nilsen & Nani Teig

Eckhard Klieme & Trude Nilsen Nani Teig Armin Jentsch & Johannes König Heather Price Rubén Fernández-Alonso & José Muñiz Ina Mullis & Dana Kelly Wolfram Schulz, Julian Fraillon, John Ainley, & Daniel Duckworth Hanna Eklöf Francesca Borgonovi Monica Rosén, Isa Steinmann, & Inga Wernersson Camilla Rjosk Emma García Victoria Rolfe & Kajsa Yang Hansen

longitudinal studies). Instead of synthesizing mere descriptive reports (e.g., reports of achievement scores or levels across countries and domains), each chapter reviews the standing of and relations among specific constructs, concepts, and models describing the inputs, processes, and outputs of education. Besides, the authors discuss the alignment between the findings from ILSA and non-ILSA studies to map the current state of the field. This part of the international handbook contains three sections that are organized along the different levels of educational systems and data analysis, that is, the student, classroom, and school level, and one section that focuses on the crosscutting theme of equity and diversity (see Table 3).

Schools, Principals, and Institutions School-level constructs and characteristics are key to understanding educational processes and conditions. The existing body of knowledge abounds in a plethora of empirical studies identifying the connection between these constructs and

12

S. Blo¨meke et al.

characteristics to relevant educational outcomes. School climate represents one of these key constructs. The broad conceptualizations of school climate include physical, social, affective, and academic aspects of school life. This section presents a systematic review of the school climate research utilizing ILSA data.

Classrooms, Teachers, and Curricula This section addresses concepts and characteristics of classrooms, teachers, and curricula. While a plethora of relevant constructs indicating these concepts exists, this section focuses on those that were frequently measured across ILSAs, including instructional practices such as inquiry-based teaching, teacher competence, professional development, and beliefs, as well as opportunities to learn and activities outside the classroom. Notably, these measures are located at different levels of analyses for different ILSAs: For instance, while TIMSS includes information on the clustering of students in classrooms and thus allows for inferences drawn on the relations between instructional variables and educational outcomes at the classroom level, PISA only specifies the school level and may not allow for such inferences. Consequently, the authors of the chapters in this section had to select the ILSA data according to the appropriate levels of analysis at which the concepts and characteristics operate.

Students, Competences, and Dispositions This section brings to attention students’ competences and dispositions in several domains and contexts. Besides reviewing the knowledge base of key outcome variables, that is, student achievement in the classical domains of mathematics, reading, and science, the authors also review the information ILSAs provide about cross-curricular skills, such as digital competence and computational thinking. Next to these outcome measures, two chapters focus on outcomes beyond achievement: student motivation, beliefs, and well-being.

Equity and Diversity Finally, this section provides insights into the cross-cutting theme of equity and diversity. The chapters in this section use a broad range of indicators to conceptualizing, measuring, and framing equity and diversity in educational contexts. These approaches include but are not limited to gender differences in educational achievement, dispersion of achievement, migration, home, parental, economic, and school background. ILSA data make visible possible gender, age, immigration, or socioeconomic status gaps in educational outcomes within and across countries, ultimately informing educational practice and policymaking. The chapters in this section review this knowledge and offer insights into the evidence that some opportunities to mitigate achievement gaps exist.

2

Background, Aims, and Theories of the Comparative Large-Scale Studies in Education Trude Nilsen, Jan-Eric Gustafsson, and Agnes Stancel-Piątak

Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Themes and Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Theoretical Foundation and Perspectives of the Handbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Educational Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Context-Input-Process Output Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Hierarchical School System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Delineating the Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Perspectives and Principles Underlying the Handbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Politically and Culturally Balanced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methodological Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Extended Student Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Positioning the Handbook Within the Existing Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Audience of This Handbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14 14 15 15 15 16 17 19 20 20 20 20 21 22 23 23

T. Nilsen (*) Department of Teacher Education and School Research, Faculty of Educational Sciences, University of Oslo, Oslo, Norway e-mail: [email protected] J.-E. Gustafsson University of Gothenburg, Gothenburg, Sweden e-mail: [email protected] A. Stancel-Piątak IEA Hamburg, Hamburg, Germany e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_2

13

14

T. Nilsen et al.

Abstract

A large body of knowledge and competence has accumulated since the first international large-scale assessment (ILSA) was implemented in the 1960s. Moreover, ILSA inspired numerous valuable debates about education, policy, assessment, and measurement over this period. Since the first ILSA, the number of publications using ILSA data for research has increased near exponentially. Given the important role of the ILSAs for education, policy, practice, and research, there is a need to synthesize all this knowledge. This handbook synthesizes the knowledge that has emerged from the ILSAs, the debates on ILSAs, theories underlying the ILSAs, historical and political perspectives, the methodology pertaining to ILSAs, and the findings from the studies using ILSA data. The present chapter introduces the handbook, and describes its aims, themes, underlying theories, and perspectives. Furthermore, the chapter positions the handbook into the existing literature and describes the targeted audience. Keywords

International Large-scale Assessment · Educational effectiveness · The IOP model · The hiearchical school system · Political and methodological perspectives

Introduction International large-scale assessment (ILSA) has become increasingly important for education, policy, practice, and research since the early ILSAs in the 1960s (Husén & Postlethwaite, 1996; Wagemaker, 2020). Indeed, the ILAs have undergone a tremendous development since they were first implemented, and there have been mutual influences between ILSAs and society, policy, and research during this period (Rutkowski et al., 2014). During this development, much knowledge and competence has been built within the international research society. Due to the important role of ILSA for policy, practice, and research, it is important to synthesize the knowledge that has developed for more than half a century. It is further important to synthesize the many debates spurred by ILSA, such as the role of ILSA for policy and education, or what education should be and how and why one measures student outcomes (e.g., Addey et al., 2017). These debates, as well as theories underlying the existing ILSAs, political and historical perspectives, methodologies, and findings from ILSAs are synthesized in this International Handbook of Comparative LargeScale Studies in Education.

Aim The overarching aims of this book are threefold. Firstly, the aim is to provide a general and comprehensive overview of a variety of important topics related to ILSAs (i.e., history and development, theories, methodology, findings from

2

Background, Aims, and Theories of the Comparative Large-Scale Studies. . .

15

analysis of ILSA data and their consistency with meta-studies). Secondly, to present an in-depth view into selected topics, such as theoretical foundations of ILSAs, selected methods, and results. The handbook will provide systematic insight into the knowledge generated by ILSAs within central areas of educational research (e.g., teacher competence, school climate, equity). Thirdly, to critically discuss the benefits and challenges related to the development, implementation, and analysis of ILSA data, and the impacts of ILSAs on policy, research, and practice, providing recommendations for stakeholders and researchers.

Themes and Content The main theme of the handbook revolves around ILSAs, their background, policy, theoretical foundations, methodology, and findings. It consists of three main parts pertaining to (1) meta-perspectives and theories, (2) methodology, and (3) findings. 1. In the first part, historical, political, and economical meta-perspectives on ILSAs are discussed. Moreover, systemic theories for the analysis of educational systems are presented, as well as an in-depth discussion of several theories underlying equity and cognitive and affective outcomes. The first part also provides an overview of all ILSAs conducted, including their aims, frameworks, and designs. Accomplishments and limitations of ILSAs, including recommendations for future directions, are discussed. 2. The second part of the book on methodology, addresses methods for the implementation of ILSAs and the analysis of data from ILSAs, including reviews of the most commonly used methods, and recommendations. Further, the authors present both the potential benefits and limitations of ILSAs for national education policy and research, using selected examples such as longitudinal extensions of ILSAs. 3. Lastly, the third part of the handbook provides a substantial number of reviews and systematic reviews of findings from studies using ILSA data, describing how these findings contribute to prominent fields in educational research (e.g., equity, school climate, and teacher quality). It further discusses how these findings align with the current state of these fields and points out knowledge gaps for future research.

The Theoretical Foundation and Perspectives of the Handbook Educational Effectiveness While this handbook includes chapters pertaining to many areas and fields, such as sociology, educational economy, educational psychology, teacher education, applied methodology, psychometrics, and educational inequality, the underlying theory behind ILSAs and this handbook is educational effectiveness.

16

T. Nilsen et al.

Educational effectiveness is a field that concerns identifying and investigating factors that may explain variation in student outcomes (Creemers & Kyriakides, 2008; Reynolds et al., 2014). The field has developed side by side with ILSAs, and the two have mutually influenced one another (for more on this, see ▶ Chap. 12, “Using ILSAs to Promote Quality and Equity in Education: The Contribution of the Dynamic Model of Educational Effectiveness,” by Kyriakides, in this handbook). It is a broad field encompassing many areas such as teachers’ instruction, school climate, student motivation, and curriculum to mention a few. Teacher effectiveness and school effectiveness have been subsumed under the field of educational effectiveness (Reynolds et al., 2014). However, educational effectiveness has its roots in a narrower field, school effectiveness, which does not consider the dynamic nature of school over time and across different populations, outcomes, and subject domains (Creemers & Kyriakides, 2008). Educational effectiveness on the other hand, acknowledges the complex and dynamic school system including growth over time as well as stability and differential effectiveness (Kyriakides, this handbook; Scheerens, 2016).

The Context-Input-Process Output Model Based on behavioral and cognitive theories, Carroll (1963) built one of the earliest models of learning, namely, the model of school learning that was developed within the input-process-output framework. This model was used by the early ILSAs (Cresswell et al., 2015; Wagemaker, 2020). Since then, comprehensive models of learning have emerged, developed, and changed. Most temporary comprehensive models of schooling and learning are based on the input-process-output (IPO) approach (for more on this, see ▶ Chap. 8, “Comprehensive Frameworks of School Learning in ILSAs,” by Stancel-Piątak and Schwippert, in this handbook). The idea behind the IPO approach is that learning outcome is a function of input by means of process. The Context-Input-ProcessOutput (CIPO) model by Scheerens (1990) was built on the input-process-output approach. However, the CIPO contends that input, process, and output are all affected by the context (see Fig. 1, a simplification of the model by Scheerens (1990), page. 74). This model was used within the field of school effectiveness and by various ILSAs (Kuger et al., 2016; Scheerens, 1990). The IPO approach is also underlying the intended, implemented, and achieved (students’ learning outcomes) curriculum that is one of the theoretical foundations of the IEA studies (Mullis & Martin, 2007). Figure 2 shows an illustration of the three curricular levels, where the intended curriculum reflects the national or system level and would typically contain the national curriculum. The implemented curriculum reflects what the teachers actually teach, while the achieved curriculum refers to student outcomes such as achievement or student motivation. Indeed, the CIPO and the curriculum levels lie at the core of the handbook, where the first part of the handbook: “Meta-perspectives on ILSAs,” including theories behind the ILSAs, historical and policy perspectives, may be viewed as inputs to

2

Background, Aims, and Theories of the Comparative Large-Scale Studies. . .

17

Context (e.g. school size)

Process Input (e.g. teacher qualifications)

School level (e.g. leadership) -----------------Class-level (e.g. teaching quality)

Output (e.g. student achievement)

Fig. 1 CIPO-model by Scheerens (1990)

Intended curriculum

•e.g. the national curriculum Implemented •What the teachers teach curriculum Achieved curriculum

•Student achievement

Fig. 2 The curriculum levels (see, e.g., Mullis & Martin, 2007)

ILSAs. The second part of the handbook, “Methodology,” including implementations, designs, and methods of ILSAs, may be viewed as the process of ILSAs. The last part of the handbook, “Findings,” including systematics reviews and reviews of findings based on ILSA data, may be viewed as the output.

The Hierarchical School System In addition to using the CIPO-model as an overarching framework for this handbook, the hierarchical school system is considered to be important for ILSAs. A hierarchical school system refers to the fact that students are nested within classrooms, classrooms are nested within schools, and schools are nested within regions, which again are nested within countries. Taking this design into account is important for developing and implementing the ILSAs, it is important for sampling procedures

18

T. Nilsen et al.

and design, but it is also important to take this into consideration when analyzing data from ILSAs. Accounting for the hierarchical school system is hence especially important for the last two parts of the book (methodology and findings). The IPO and CIPO later evolved into several models of learning (see ▶ Chap. 8, “Comprehensive Frameworks of School Learning in ILSAs,” by Stancel-Piątak and Schwippert, in this handbook). Within educational effectiveness, the model most extensively used is the dynamic model of educational effectiveness by Creemers and Kyriakides (2008) (for more on this, see chapter by Kyriakides, in this handbook). This model takes into account the dynamic nature of the school system, and its hierarchical design with direct and indirect effects within and between levels. Fig. 3 serves as an illustration of the theoretical foundation of the handbook, and resembles a simplified illustration of the dynamic model of educational effectiveness (Creemers & Kyriakides, 2008). There are some minor differences between the dynamic model and Fig. 3. In Fig. 3, the levels of education are included (e.g., school level), while the student level refers to individual characteristics only (e.g., SES and gender) because other student level factors such as motivation and well-being are included among outcomes. The figure illustrates the different layers of the school system and how they interact. The illustration is simplified because, after all, within a school system, almost everything interacts, and a completely accurate model of this would be extremely complex. The figure illustrates that the strongest relationships are those between proximal factors at the student and teacher level and student outcomes (thick arrows), while more distal factors at the school and national level are expected to have weaker effects on student outcomes (thin arrows). However, while a fair share of previous research has shown that distal factors may have weaker effects than proximal factors (e.g., Hattie, 2009), this varies between countries, domains, age, and over time (e.g., Blomeke & Olsen, 2019). The international and national levels of the dynamic model provide educational policy, and create new national curricular reforms, that would first and foremost National and international level (e.g. educational policy)

School level (e.g. school climate, accountability, leadership)

Teacher level (e.g. teacher competence, believes, and their instruction)

Outcomes (e.g. cognitive outcomes such as achievement in mathematics, non-cognitive outcomes such as student motivation)

Student level (e.g. characteristics such as SES and gender)

Fig. 3 The theoretical foundation of the handbook, based on the dynamic model

2

Background, Aims, and Theories of the Comparative Large-Scale Studies. . .

19

affect the schools, but could also affect student outcome (Creemers & Kyriakides, 2008). At the school level, factors such as leadership and school climate could affect the teachers (e.g., Blömeke et al. 2021; Scherer & Nilsen, 2016), and also student outcomes (Wang & Degol, 2016). There could also be indirect effects, for instance, a sound and safe school climate may have a positive influence on teachers, which again may positively influence student outcomes (Scherer & Nilsen, 2016). At the class-level, teachers have been shown to have a strong effect on students, both in terms of their competence (e.g., pedagogical content knowledge), their characteristics (e.g., beliefs and experience), and their instruction (Baumert et al., 2010; Blömeke et al. 2015; Kuger et al., 2016; Nilsen & Gustafsson, 2016). However, there may be indirect relationships; the effect of teacher competence on student achievement is often mediated via their instruction (e.g., Baumert et al., 2010). Last but not least, student characteristics such as gender and home background affect student outcomes, and some of these characteristics (e.g., home background) may be influenced by educational policy at the national level (e.g., Hansen & Munck, 2012; Sirin, 2005). Moreover, student characteristics may affect teachers’ instruction. There are more direct and indirect effects both within and across levels than described here; however, the point is to underline that researchers implementing ILSAs and analyzing the data need to take these effects and the hierarchical system into account.

Delineating the Scope In the present handbook, the focus is on ILSAs, that is international large-scale assessments. The word “international” in ILSA reflects studies where all countries may potentially participate, such as TIMSS. Any country is free to participate in TIMSS. However, many would also include regional studies such as SACMEC (Southern and Eastern Africa Consortium for Monitoring Educational Quality) under the umbrella of ILSA. The word assessment in ILSA refers to assessment of cognitive skills of some kind. In this regard, it is important to differentiate between a survey, such as TALIS, and an assessment, such as PISA or ICILS, which includes assessment of students’ cognitive skills. This does not mean that neither regional studies, nor surveys such as TALIS are excluded. It rather means that the handbook focuses more on international studies, which include assessment. Moreover, the focus of the handbook is on cyclical studies, which concern schooling, including primary, lower secondary and upper secondary school. Studies focusing on early childhood education (pre-primary, pre-school, or kindergarten) such as TALIS Starting Strong, or on the adult population (such as PIAAC) are not of primary focus. The only international large-scale studies that are open to participation from all countries that include assessment of student outcomes, include students in school, and are repeated regularly are TIMSS, PISA, PIRLS, and ILCILS (see https://ilsagateway.org/). These studies will hence be exemplified and referred to more

20

T. Nilsen et al.

extensively than other studies. In addition, there is one chapter focusing on regional studies. Regional studies and national extensions are described in several chapters of the theoretical and methodological section.

Perspectives and Principles Underlying the Handbook Just like there are certain theoretical models that underlie this handbook, such as the dynamic model of educational effectiveness, there are also certain principles and perspectives that this book is founded on.

Politically and Culturally Balanced Our goal is to provide a handbook that is balanced. There are a number of strong voices both against and for the ILSAs. In this handbook chapters with both pro-ILSA voices and critiques are presented. The goal is further to provide a culturally balanced handbook, and we have hence strived to include authors from both Western and non-Western countries. The reader might acknowledge that a possible impression of an imbalance in this regard must be partially attributed to the fact that ILSAs themselves are still dominated by researchers and contributions from developed countries and the so-called “western societies”

Methodological Perspectives While quantitative methodology underlies much of the ILSAs design, sampling procedures, assessment, and analyses, the handbook includes a contribution focusing on the qualitative perspectives (see ▶ Chap. 18, “The Use of Video Capturing in International Large-Scale Assessment Studies: Methodological and Theoretical Considerations” by Klette). Moreover, many authors have included some reflections on these perspectives in their chapters (e.g., ▶ Chap. 13, “Overview of ILSAs and Aspects of Data Reuse” by Mertes).

Extended Student Outcomes Even though many of the areas of research within educational effectiveness are dominated by research within the domain of mathematics (e.g., within the areas of teacher competence and instruction, student motivation), this handbook provides insight into research focusing on other subject domains, like reading and science. While these three subject domains dominate the ILSA scientific map, the handbook further emphasizes the importance of including other domains such as aesthetic

2

Background, Aims, and Theories of the Comparative Large-Scale Studies. . .

21

subjects (see, e.g., ▶ Chap. 20, “ILSA in Arts Education: The Effect of Drama on Competences” by Gürgens Gjærum et al.) as well as cross-cutting generic competences such as digital competences (see, e.g., ▶ Chap. 45, “Digital Competences: Computer and Information Literacy and Computational Thinking” by Schulz, Fraillon, Ainley, and Duckworth), or inquiry (see, e.g., ▶ Chap. 40, “Inquiry in Science Education” by Teig).

Positioning the Handbook Within the Existing Literature The impact of ILSAs on policy, research, and practice has increased substantially over the past decades. Growing demands, expectations, and attention regarding results from ILSAs, along with technological advancements in research, have led to the development of ILSAs toward higher quality but also higher complexity. These studies produce vast amounts of data and are continually growing in number. Moreover, the data allows for increasingly robust inferences and valuable findings for policymakers and researchers with respect to new and old areas of educational research. Hence, there is a growing need for information about ILSAs to researchers, stakeholders, policymakers, and practitioners. This information, however, tends to be scattered, as publications usually only address selected aspects of ILSAs and target specific audiences. For instance, the publication “Handbook of International Large-Scale Assessment, Background, Technical Issues, and Methods of Data Analysis” by Rutkowski, van Davier and Rutkowski (2014) is an excellent publication focused on methodology directed especially at researchers who are well familiar with ILSAs and with advanced expertise in psychometrics. There are also other handbooks in the field, but none is open access, and all are rather focused on a specific area (e.g., methods) and directed to a particular audience (e.g., researchers with psychometric and statistical skills). Hence, there is a need for a publication that gathers all substantial information related to ILSA, providing a comprehensive overview, while also providing insights into a variety of aspects of ILSAs in a scientific, constructive, and critical manner. This handbook addresses this need, by including historical, political, and economic meta-perspectives; overviews of ILSAs focusing on student learning as defined above; theories; methods; and findings from studies using ILSAs. Included in the handbook, for example, are major debates about ILSAs from a neoliberal and social network perspective, the role of ILSAs in developing countries, ILSAs and educational accountability, as well as thorough reviews, systematic reviews, and critical reflections on theories. Furthermore, conceptual and methodological accomplishments of ILSAs and remaining criticism and limitations are discussed. The handbook therefore provides comprehensive and valuable information for different groups such as researchers, students, teachers, and stakeholders.

22

T. Nilsen et al.

The Audience of This Handbook There is a need for a publication that is accessible, understandable, international, and useful – not only to advanced and beginning educational researchers, but also providing useful information for policymakers, stakeholders, and practitioners. This book will address this need, embracing a broader audience than other handbooks of ILSAs. Moreover, the book is international and includes renowned authors and editors from all continents addressing ILSAs of interest to both Western and non-Western countries. Care is taken to provide balanced insights into a variety of different ILSAs launched and conducted by diverse stakeholders and organizations. The book is directed predominantly to researchers. The language used and the content is aimed to address advanced researchers from the educational field and also young researchers and students. By providing a broad coverage of reviews from different fields, the book should be valuable not only to educational researchers but also to researchers from other disciplines interested in education (e.g., psychometricians, sociologists, statisticians, or economists). Especially, researchers and students less familiar with ILSA may use the book to learn about ILSAs in the context of policy, their theoretical underpinnings, or research pertaining to ILSAs. The comprehensive and in-depth discussions presented in the theoretical part as well as the extensive and partially systematic reviews included into the findings section might be of great value for advanced scientists as well as for young researchers. Most of the methodology section is written in a manner that is understandable and accessible for students and researchers not familiar with the data. Nevertheless, the methodology part is also valuable for experienced researchers who are familiar with ILSA data, as it provides in-depth presentations of the design and sampling procedures of several ILSAs, as well as advice on comprehensive methods of analysis. It contains detailed overviews and insights into a number of ILSAs, providing knowledge of how the data is used by the research community and future recommendations on the instruments. However, not only researchers might benefit from this book. Also, politicians and stakeholders interested in learning about ILSAs may value learning about the history and developments of ILSAs, as well as about their potential and limitations with regard to, for example, policy implications. Furthermore, the part containing reviews of findings from studies analyzing ILSA data should be of interest to stakeholders, teachers, and policymakers. In more general terms, advanced or senior researchers may value the quality of the in-depth views, the critical perspectives, the fact that single topics always are embedded into the overall discussion, as well as the diverse perspectives discussed. Junior researchers may value the accessible language, the comprehensiveness, and the research examples provided. Policymakers and practitioners may value the visual examples provided in several chapters, the comprehensive summaries, the structure and clarity of the language, and the relevance of the findings in terms of coverage of topics in which policymakers are interested. Teachers and other practitioners who would like to know more about ILSAs and outcomes of studies using ILSA data

2

Background, Aims, and Theories of the Comparative Large-Scale Studies. . .

23

(e.g., regarding teachers’ instructional quality) may also find the handbook valuable. In addition, the book should also be useful for universities who offer courses on comparative education and who need a comprehensive textbook on ILSAs.

Summary This chapter has described the theoretical underpinnings of the handbook, and provided the aims, themes, perspectives, and principles underlying the handbook, the rationale and necessity of the handbook, and the intended audience of the handbook. A large number of the authors of the chapters in the handbook are well-renowned researchers within their fields. The contributions have undergone a blind peer-review and we would like to thank all authors for their extraordinary contributions, their patience, and willingness to adopt changes to align their publications with the goals of this handbook. All the authors and the chapters, along with an overview of the content and structure, are provided in the chapter “Abstract and overview.” The last chapter of the handbook, ▶ Chap. 52, “60-Years of ILSA: Where It Stands and How It Evolves,” is an extensive discussion of the chapters, of findings, perspectives, and outlooks.

References Addey, C., Sellar, S., Steiner-Khamsi, G., Lingard, B., & Verger, A. (2017). The rise of international large-scale assessments and rationales for participation. Compare: A Journal of Comparative and International Education, 47(3), 434–452. Baumert, J., Kunter, M., Blum, W., Brunner, M., Voss, T., Jordan, A., . . . & Tsai, Y.-M. (2010). Teachers’ mathematical knowledge, cognitive activation in the classroom, and student progress. American Educational Research Journal, 47(1), 133–180. Blomeke, S., & Olsen, R. V. (2019). Consistency of results regarding teacher effects across subjects, school levels, outcomes and countries. Teaching and Teacher Education, 77, 170–182. https:// doi.org/10.1016/j.tate.2018.09.018. Blömeke, S., Gustafsson, J.-E., & Shavelson, R. J. (2015). Beyond dichotomies: Competence viewed as a continuum. Zeitschrift für Psychologie, 223(1), 3–13. Blömeke, S., Nilsen, T., & Scherer, R. (2021). School innovativeness is associated with enhanced teacher collaboration, innovative classroom practices, and job satisfaction. Journal of Educational Psychology, 113(8), 1645–1667. https://doi.org/10.1037/edu0000668. Carroll, J. (1963). A model of school learning. Teachers College Record, 64(8), 723–723. Creemers, B., & Kyriakides, L. (2008). The dynamics of educational effectiveness. A contribution to policy, practice and theory in contemporary schools. Routledge. Cresswell, J., Schwantner, U., & Waters, C. (2015). A review of international large-scale assessments in education: Assessing component skills and collecting contextual data. PISA, The World Bank, Washington, D.C./OECD Publishing, Paris. Hansen, K. Y., & Munk, I. (2012). Exploring the measurement profiles of socioeconomic background indicators and their differences in reading achievement: A two-level latent class analysis. IERI Monograph Series: Issues and Methodologies in Large-Scale Assessments, 5, 49–77. Hattie, J. (2009). Visible learning: A synthesis of over 800 meta-analyses relating to achievement. Routledge.

24

T. Nilsen et al.

Husén, T., & Postlethwaite, T. N. (1996). A brief history of the international association for the evaluation of educational achievement (TEA). Assessment in Education: Principles, Policy & Practice, 3(2), 129–141. Kuger, S., Klieme, E., Jude, N., & Kaplan, D. (2016). Assessing contexts of learning: An international perspective. Springer International Publishing, Switzerland. Mullis, I. V., & Martin, M. O. (2007). TIMSS in perspective: Lessons learned from IEA’s four decades of international mathematics assessments. Lessons learned: What international assessments tell us about math achievement, 9–36. Nilsen, T., & Gustafsson, J.-E. (Eds.). (2016). Teacher quality, instructional quality and student outcome. Relationships across countries, cohorts and time. (Vol. 2): Springer International Publishing AG, Switzerland. Reynolds, D., Sammons, P., De Fraine, B., Van Damme, J., Townsend, T., Teddlie, C., & Stringfield, S. (2014). Educational effectiveness research (EER): A state-of-the-art review. School Effectiveness and School Improvement, 25(2), 197–230. Rutkowski, L., Davier, M. v., & Rutkowski, D. (2014). Handbook of international large-scale assessment. Background, technical issues, and methods of data analysis. Chapman and Hall/ CRC. Scheerens, J. (1990). School effectiveness research and the development of process indicators of school functioning. School Effectiveness and School Improvement, 1(1), 61–80. Scheerens, J. (2016). Opportunity to learn, curriculum alignment and test preparation: A research review: Springer International Publishing AG, Switzerland. Scherer, R., & Nilsen, T. (2016). The relations among school climate, instructional quality, and achievement motivation in mathematics. In: T. Nilsen, Gustafsson, J.E. (Eds.), Teacher quality, instructional quality and student outcomes. Relationships across countries, cohorts and time (Vol. 2, pp. 51–79). IEA: Springer. Sirin, S. R. (2005). Socioeconomic status and academic achievement: A meta-analytic review of research. Review of Educational Research, 75(3), 417–453. Wagemaker, H. (2020). Reliability and validity of international large-scale assessment: Understanding IEA’s comparative studies of student achievement. Springer Nature, Switzerland. Wang, M.-T., & Degol, J. L. (2016). School climate: A review of the construct, measurement, and impact on student outcomes. Educational Psychology Review 28(2), 315–352. https://doi.org/ 10.1007/s10648-015-9319-1.

Part II Meta-perspectives on ILSAs: Theoretical Meta-perspectives on ILSAs

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital in Economic Growth Eric A. Hanushek and Ludger Woessmann

Contents Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Understanding Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Conceptual Framework for Knowledge and Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Growth Models with School Attainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An Extended View of the Measurement of Human Capital . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Knowledge Capital and Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Gains from Universal Basic Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Global Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Economic Impacts of Universal Basic Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28 30 31 32 34 36 40 43 44 46 49 51 51

E. A. Hanushek Stanford University, Stanford, CA, USA CESifo, Munich, Germany National Bureau of Economic Research, Cambridge, MA, USA IZA, Bonn, Germany e-mail: [email protected] L. Woessmann (*) University of Munich and ifo Institute, Munich, Germany CESifo, Munich, Germany IZA, Bonn, Germany e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_4

27

28

E. A. Hanushek and L. Woessmann

Abstract

Economic theory suggests that the skills of a society’s population are important determinants of economic growth. ILSAs have been used to put these theories to an empirical test. This chapter provides an overview of models of the role of educational achievement in macroeconomic outcomes and summarizes empirical economic work using ILSAs to measure relevant skills. In economic terms, the aggregate cognitive skills of the population as measured by ILSAs can be interpreted as the knowledge capital of nations. The chapter concludes that there is strong evidence that the cognitive skills of the population – rather than mere school attainment – are powerfully related to long-run economic growth. The relationship between knowledge capital and growth proves extremely robust in empirical applications. Growth simulations reveal that the long-run rewards to educational quality are large but also require patience. Keywords

Economic growth · Knowledge capital · Human capital · ILSA · Cognitive skills

Overview With the passage of the Sustainable Development Goals (SDGs) by the United Nations in 2015, the topic of economic growth became central to worldwide discussion (https://sustainabledevelopment.un.org/sdgs). The 17 separate goals cannot be achieved without strong economic growth that expands the resources available and that permits addressing this range of worthwhile but expensive goals. But achieving strong economic growth is not possible without developing strong human capital in each of the countries. This chapter, which draws on Hanushek and Woessmann (2015a, b), describes what is known about economic growth and how it is affected by strong education programs. Even in the richest countries, a segment of the population has been left behind to deal with limited resources and limited opportunities. This segment has faced health insecurity, constrained job possibilities, and a myriad of other threats associated with poverty. The difficulties for this group are compounded when countries as a whole lag behind world improvement in economic outcomes. Sustainable development calls for recognizing the full costs of development. In the past, growth and development have come with costs to the environment. These costs accumulate over time, leading to excessive pressures on the ecosystem that threaten the future. Sustainable development will depend on innovation that permits growth while respecting natural resources. The key to achieving inclusive and sustainable development lies in increasing the knowledge and skills of populations. Knowledge-led growth, the hallmark of at least the past half century, provides a path that converges on the overall goals of the broader world community. Inclusive development is best pursued through expanded economic opportunities. Simply put, it is much easier to ensure inclusion and to

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital. . .

29

alleviate the burdens of poverty when the whole economic pie is larger. Expanded skills allow a broader segment of society to actively contribute to the economy; this increased participation contributes to enhanced productivity and reduces the redistributive needs. Within a fixed economy, even attempting to redistribute resources is generally politically difficult and may threaten the overall performance of the economy. Expanded skills also facilitate sustainable development and growth because they lead to innovative capacity that allows economic advancement without simultaneously depleting environmental resources. Without the necessary cognitive skills to compete and thrive in the modern world economy, many people are unable to contribute to and participate in development gains. Literacy was once defined in terms of the ability to read simple words. But in today’s global marketplace, it is more. It is the capacity to understand, use, and reflect critically on written information; the capacity to reason mathematically and use mathematical concepts, procedures, and tools to explain and predict situations; as well as the capacity to think scientifically and to draw evidence-based conclusions. Today a substantial fraction of the world’s population is functionally illiterate. The functional illiterates do not have the skills that employers seek and that the labor market rewards. While minimal skills are important for individual participation in modern economies, the discussion here focuses mostly on the aggregate implications of the cognitive skills of a nation’s workforce. Where significant proportions of the population have limited skills, economies are generally bound to employ production technologies that lag those in advanced economies. They also have more limited ability to innovate or even to imitate the possibilities that are found near the economic production frontier. Empirically, as described below, international large-scale assessments (ILSAs) provide useful measures of the cognitive skills that are relevant for growth. Aggregate cognitive skills form the knowledge capital of a nation, and aggregate scores on international tests prove to be good measures of knowledge capital. The economic evidence indicates that countries with less skilled populations – with less knowledge capital – will find productivity improvements difficult. As a result, they will find economic growth and development to be slower. In addition, what growth there is will be less inclusive because those without minimal skills will be unable to keep pace with their more-skilled peers. Cognitive skills are of fundamental importance for developing countries. But these skills also matter for advanced countries. Thus development goals built around minimal skills have meaning to all societies around the world. They correct the distorted picture of the challenges facing the world suggested by the original Millennium Development Goals and the Education for All initiative, which framed the issue of education and skills as relevant to developing countries only (https://www.un.org/ millenniumgoals/). The challenges have clearly been more severe for less developed economies, but they were and are real for more developed economies as well. Existing research shows that there has historically been a strong and direct relationship between the cognitive skills of national populations, measured by international tests of mathematics and science achievement, and countries’ longrun growth. In fact, ILSAs have provided the data needed to understand the growth

30

E. A. Hanushek and L. Woessmann

process. Moreover, as discussed further below, the evidence in this modern growth analysis provides strong reason to believe that the relationship is causal – i.e., if a nation improves the skills of its population, it can expect to grow faster. ILSAs of course do not provide a complete view of every country’s population. Some countries have not participated in international tests, so they cannot be directly compared with others, although participation in regional tests in Latin America and Africa provides information for a larger set of countries. Further, even in countries that do participate, the proportion of students who have already left school – and who are therefore out of the view of international testing – varies. The heart of the analysis summarized here offers a concise economic perspective on a primary development goal – bringing all youth up to minimally competitive skills. This fundamental goal emphasizes the importance of skills over mere school attendance. But of course, youth are unlikely to develop appropriate skills without attending school, and the analysis below builds upon the prior development goals related to access. The analysis extends the simple cognitive skills goal to include schooling for all along with minimal skills for all. The past record on the interplay of cognitive skills and economic growth provides a means of estimating the economic gains from meeting the development goal set out in the SDGs. The economic benefit from meeting SDG #4 (“Ensure inclusive and equitable quality education”) can be calculated as the difference in future GDP with universal minimal skills versus GDP with the country’s current knowledge capital. Indeed, it is possible to provide these estimates on a country-by-country basis, at least for the 76 countries with current information on their knowledge capital (from ILSAs) and on the state of their aggregate economy.

Understanding Growth To develop an understanding of the structure of growth, economists have followed two tracks – tracks that are largely separated but that sometimes intersect. On one track is the development of theoretical models that identify specific features and mechanisms of economies and trace their implications for growth over time. On the other track are empirical exercises designed to extract regularities in growth based on the observed differences in outcomes. At times, specific theoretical models drive a particular empirical analysis. At others, the empirical work is more loosely connected to any specific model but is driven more by data and statistical forces. Invariably, both strands of modern work on growth recognize the importance of human capital. This partly incorporates the insight from the work of Theodore Schultz (1961), Gary Becker (1964), Jacob Mincer (1974), and others since the late 1950s that human capital is important for individual productivity and earnings. But even more, innovation and productivity improvements, while possibly differing in the underlying details, are seen as being fundamentally guided by the underlying invention of people, and invention flows from the knowledge and skills of the population. The attention here is focused on the measurement of human capital and on how improved measurement alters understanding of some fundamental economic issues.

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital. . .

31

The historic empirical consideration of human capital focused on school attainment measures of human capital. As demonstrated below, however, it is necessary to transition to broader measures that revolve around cognitive skills. This section provides an overview of the conceptual underpinnings of the analysis of economic growth. It then turns to the historic empirical analysis before transitioning to the use of ILSAs to measure the knowledge capital of nations.

A Conceptual Framework for Knowledge and Growth Economists have devoted enormous time and effort to developing and understanding alternative mechanisms that might underlie the growth of nations. Indeed, entire books are written on models of economic growth and their implications. For example, see Acemoglu (2009), Aghion and Howitt (1998, 2009), Barro and Salai-Martin (2004); Jones and Vollrath (2013) for introductions. The aim here is simply providing the outlines of competing approaches, because they will have implications not only about how to proceed in empirical work but also about how to interpret any subsequent analyses. Theoretical models of economic growth have emphasized different mechanisms through which education may affect economic growth. As a general summary, three theoretical models have been applied to the modeling of economic growth, and each has received some support from the data. At the same time, it has been difficult to compare the alternative models empirically and to choose among them based on the economic growth data. The most straightforward modeling follows a standard characterization of an aggregate production function where the output of the macro economy is a direct function of the capital and labor in the economy. The basic growth model of Solow (1956) began with such a description and then added an element of technological change to trace the movement of the economy over time. The source or determinants of this technological change, although central to understanding economic growth, were not an integral part of that analysis. The so-called augmented neoclassical growth theories, developed by Mankiw, Romer, and Weil (1992), extend this analysis to incorporate human capital, stressing the role of education as a factor of production. Education can be accumulated, increasing the human capital of the labor force and thus the steady-state level of aggregate income. The human capital component of growth comes through accumulation of more education that implies the economy moves from one steady-state level to another. Once at the new level, education exerts no further influence on growth in such a model. The common approach to estimating this model focuses on the level of income and relates changes in GDP per worker to changes in education (and in capital). This view implies a fairly limited role of human capital, because there are natural constraints on the amount of schooling in which a society will invest. It also fails to explain patterns of education expansion and growth for many developing countries (Pritchett, 2006).

32

E. A. Hanushek and L. Woessmann

A very different view comes from the so-called endogenous growth literature that has developed over the past quarter century, partly building on the early insight of Schumpeter (1912[2006]) that growth is ultimately driven by innovation. In this work, a variety of researchers – importantly, Lucas (1988), Romer (1990), and Aghion and Howitt (1998) – stress the role of human capital in increasing the innovative capacity of the economy through developing new ideas and new technologies. These are called endogenous growth models because technological change is determined by economic forces within the model. Under these models, a given level of education can lead to a continuing stream of new ideas, thus making it possible for education to affect long-run growth rates even when no new education is added to the economy. The common way to estimate these models focuses on growth in income and relates changes in GDP per worker (or per capita) to the level of education. A final view of human capital in production and growth centers on the diffusion of technologies. If new technologies increase firm productivity, countries can grow by adopting these new technologies more broadly. Theories of technological diffusion such as those developed by Nelson and Phelps (1966), Welch (1970), and Benhabib and Spiegel (2005) stress that education may facilitate the transmission of knowledge needed to implement new technologies. In tests involving cross-country comparisons, Benhabib and Spiegel (1994) found a role for educational attainment in both the generation of ideas and in the diffusion of technology. All approaches have in common that they see human capital as being a crucial ingredient to growth.

Growth Models with School Attainment This section provides an overview of economic growth modeling. (Further details can be found in Hanushek and Woessmann (2008), on which this section is based.) The following equation provides a very simple but convenient growth model: a country’s rate of economic growth is a function of the skills of workers (human capital) and other factors (initial levels of income and technology, economic institutions, and other systematic factors) and some unmeasured factors, ε. growth ¼ γHuman Capital þ βOther þ e

ð1Þ

Worker skills are best thought of simply as the workers’ human capital stock. For expositional purposes, this simple model assumes that there is only one dimension to human capital and that growth rates are linear in these inputs, although these are not really important for the analysis below. Human capital is nonetheless not directly observed. To be useful and verifiable, it is necessary to specify the measurement of human capital. The vast majority of existing theoretical and empirical work on growth begins – frequently without discussion – by taking the quantity of schooling of workers as a direct measure of

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital. . .

33

human capital. This choice was largely a pragmatic one related to data availability, but it also had support from the empirical labor economics literature. Mincer (1974), in looking at the determinants of wages, demonstrated that years of schooling was an informative empirical measure of differences in individual skills. In what might be called the standard approach, empirical growth modeling has quite consistently relied on school attainment averaged across the labor force as the measure of aggregate human capital. Early work employed readily available crosscountry data on school enrollment rates, which essentially were interpreted as capturing changes in school attainment. An important innovation by Barro and Lee (1993, 2001, 2013) was the development of internationally comparable data on average years of schooling for a large sample of countries and years, based on a combination of census or survey data on educational attainment wherever possible and using literacy and enrollment data to fill gaps in the census data. Following the seminal contributions by Barro (1991, 1997) and Mankiw et al. (1992), a vast early literature of cross-country growth regressions has tended to find a significant positive association between quantitative measures of schooling and economic growth. To give an idea of the reliability of this association, primary schooling turns out to be the most robust influence factor (after an East Asian dummy) on growth in GDP per capita in 1960–1996 in the extensive robustness analysis by Sala-i-Martin, Doppelhofer, and Miller (2004) of 67 explanatory variables in growth regressions on a sample of 88 countries.

Fig. 1 Years of schooling and economic growth rates without considering knowledge capital. Notes: Added-variable plot of a regression of the average annual rate of growth (in percent) of real GDP per capita in 1960–2000 on average years of schooling in 1960 and initial level of real GDP per capita in 1960 (mean of unconditional variables added to each axis). (Source: Hanushek and Woessmann (2015a))

34

E. A. Hanushek and L. Woessmann

Figure 1 plots the average annual rate of growth in GDP per capita over the 40-year period from 1960 to 2000 against years of schooling at the beginning of the period for a sample of 92 countries. This figure is based on an underlying regression that also includes the initial level of GDP per capita for each country. Inclusion of this reflects the fact that for countries starting behind, it is easier to grow fast because it is just necessary to copy what is done elsewhere. Countries starting ahead must invent new things and new production methods in order to grow fast, and that is generally more difficult. The results depicted by the figure imply that each year of schooling is statistically significantly associated with a long-run growth rate that is 0.6 percentage points higher. There are skeptical studies that raise noteworthy caveats with this depiction. First, Bils and Klenow (2000) raise the issue of causality, suggesting that reverse causation running from higher economic growth to additional education may be at least as important as the causal effect of education on growth in the cross-country association. Second, one of the conclusions that Pritchett (2001, 2006) draws from the fragility of the evidence linking changes in education to economic growth is that it is important for economic growth to get other things right as well, in particular the institutional framework of the economy. Both issues are actually subsumed by better measurement of human capital as discussed next.

An Extended View of the Measurement of Human Capital Growth models that measure human capital by average years of schooling implicitly assume that a year of schooling delivers the same increase in knowledge and skills regardless of the education system. For example, a year of schooling in Venezuela is assumed to create the same increase in productive human capital as a year of schooling in Singapore. Additionally, this measure assumes that formal schooling is the primary (sole) source of skills and that variations in non-school factors have a negligible effect on education outcomes. This neglect of cross-country differences in the quality of education and in the strength of family, health, and other influences is a major drawback of using the quantitative measure of school attainment to proxy for skills of the labor force in cross-country analyses. The larger issues can be better understood by considering the source of the skills (human capital). As discussed in the extensive educational production function literature (Hanushek, 2002), these skills are presumed to be affected by a range of factors including family inputs, the quantity and quality of inputs provided by schools, other relevant factors (including labor market experience, health, and so forth), and unmeasured input (ν) as in: Human Capital ¼ λFamily þ ϕSchool quality þ ηOther þ ν

ð2Þ

The schooling term is meant to combine both school attainment and its quality. Indeed, there is a broad research base that documents each of the inputs components.

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital. . .

35

Obviously, if Eq. 2 describes the formation of skills, simply relying on school attainment in the growth modeling is unlikely to provide reasonable estimates of the role of human capital. They will undoubtedly be biased estimates, and they will be sensitive to the exact model specification and to the inclusion of other country measures – exactly what past analysis shows. The complications from the multiple inputs into skills suggest that the alternative is measuring human capital directly. A compelling alternative is to focus directly on the cognitive skills component of human capital and to measure human capital with test score measures of mathematics, science, and reading achievement. The use of measures of educational achievement, which builds on prior research into both educational production functions and models of economic returns to individuals, has a number of potential advantages. First, achievement captures variations in the knowledge and skills that schools strive to produce and thus relate the putative outputs of schooling to subsequent economic success. Second, by emphasizing total outcomes of education, such measures incorporate skills from any source – families, schools, and ability. Third, by allowing for differences in performance among students with differing quality of schooling (but possibly the same quantity of schooling), they open the investigation of the importance of different policies designed to affect the quality aspects of schools. Some recent work has introduced the possibility that noncognitive skills also enter into individual economic outcomes (see importantly Bowles, Gintis, and Osborne (2001); Heckman, Stixrud, and Urzua (2006); Cunha, Heckman, Lochner, and Masterov (2006); Borghans, Duckworth, Heckman, and ter Weel (2008); Almlund, Duckworth, Heckman, and Kautz (2011); Lindqvist & Vestman, 2011). Hanushek and Woessmann (2008) integrate noncognitive skills into the interpretation of general models such as the one described here and show how this may affect the interpretation of the parameter on school attainment and other estimates. While there are no agreed-upon measures of noncognitive skills, at the aggregate level they might well be incorporated in “cultural differences.” In any event, the lack of crosscountry data on these noncognitive factors precludes incorporating them directly into any statistical analyses. Importantly, for analyzing differences in growth across countries, the range of ILSAs conducted since the mid-1960s offers the possibility of directly comparing skills across countries. While varying numbers of countries have participated in international testing over time, a substantial fraction of countries of the world have participated in international testing at one time or another. For a history of the relevant ILSA for use in growth analysis, see Hanushek and Woessmann (2015a). Unfortunately, until recently, no effort has been made to link the separate tests over time. While there have been varying regional tests, the two main tests linking countries across regions are TIMSS (Trends in International Mathematics and Science Study) and its predecessors and PISA (Programme for International Student Assessment). Given the different test designs, can results be compared across countries? And can the different tests be aggregated? Interestingly, the TIMSS tests with their curricular focus and the PISA tests with their real-world application focus are highly correlated

36

E. A. Hanushek and L. Woessmann

at the country level. For example, the correlation coefficients between the TIMSS 2003 test of eighth graders and the PISA 2003 test of 15-year-olds across the 19 countries participating in both tests are 0.87 in math and 0.97 in science, and they are 0.86 in both math and science across the 21 countries participating both in the TIMSS 1999 test and the PISA 2000/02 test. There is also a high correlation at the country level between the curriculum-based student tests of TIMSS and the practical adult literacy examinations of IALS (Hanushek & Zhang, 2009). Tests with very different foci and perspectives tend to be highly related, suggesting that they are measuring a common dimension of skills (see also Brown et al., 2007). As discussed below, the consistency lends support to aggregating different student tests for each country in order to develop comparable achievement measures. It is also encouraging when thinking of these tests as identifying fundamental skills included in “knowledge capital.” Comparisons of the difficulty of different ILSA tests across time are readily possible because the United States has participated in all assessments and because there is external information on the absolute level of performance of US students of different ages and across subjects. The United States began consistent testing of a random sample of students around 1970 under the National Assessment of Educational Progress (NAEP). By using the pattern of NAEP scores for the United States over time, it is possible to equate the US performance across each of the international tests. This approach was introduced by Hanushek and Kimko (2000) and was refined by Hanushek and Woessmann (2015a), where a complete discussion of the methodology can be found. In order to get comparable measures of variances across tests, Hanushek and Woessmann (2015a) build on the observed variations of country means for a group of countries that have well-developed and relatively stable educational systems over the time period. To compare long-term growth across countries, Hanushek and Woessmann (2015a) construct a measure of the knowledge capital of each county by averaging scores across all of the subject and year observations of skills found in TIMSS and PISA. While there are some observed score changes within countries, the overall rankings of countries show considerable stability. For the 693 separate test observations through 2003 in the 50 countries that Hanushek and Woessmann (2015a) employ in their growth analysis, 73 percent of the variance falls between countries. The remaining 27 percent includes both changes over time in countries’ scores and random noise from the testing. By averaging, the noise component will be minimized at the cost of obscuring any differences over time for each country.

Knowledge Capital and Growth The knowledge capital measure from the ILSAs turns out to be very closely related to economic growth rates in cross-country growth regressions. The central part of the analyses discussed here is the direct estimation of Eq. 1. As described in detail in Hanushek and Woessmann (2015a) and summarized here, long-term economic growth

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital. . .

37

rates across all countries with relevant data are regressed on an aggregate compilation of ILSA scores, on initial levels of national income, and on other factors. Unlike the case in prior empirical growth studies, the clearest result here is the consistency of the alternative estimates of the cognitive skills-growth relationship – both in terms of quantitative impacts and statistical significance. (For a sense of the instability surrounding early empirical analyses, see the evaluations in Levine and Renelt (1992) and Levine and Zervos (1993).) The remarkable stability of the models in the face of alternative specifications, varying samples, and alternative measures of cognitive skills implies a robustness uncommon to most cross-country growth modeling (Hanushek & Woessmann, 2015a). In terms of previous questions about the fragility of any estimates of years of schooling and growth, these estimates underscore a simple finding that prior results suffered from critical measurement issues. The central finding of the statistical analysis is the importance of cognitive skills in explaining international differences in long-run growth rates (growth over the period 1960–2000). Table 1 from Hanushek and Woessmann (2015a) presents basic results from a 50-country sample. While not the focal point of this analysis, all specifications include GDP per capita in 1960, which provides consistent evidence for conditional convergence, i.e., countries with lower initial income tend to grow faster. The sample includes all countries with both prior ILSA results and reliable historical data on GDP. As a comparison to prior cross-country analyses, the first column of Table 1 presents estimates of a simple growth model with school attainment – the model underlying Fig. 1 above, estimated on the 50-country sample. While this model explains one-quarter of the variance in growth rates, adding cognitive skills increases this to three-quarters of the variance. The aggregate test score from the ILSAs is strongly significant with a magnitude that is unchanged by whether initial school attainment in 1960 is excluded (column 2) or included (column 3).

Table 1 Years of schooling vs. cognitive skills in growth regressions (1) Cognitive skills Initial years of schooling (1960) Initial GDP per capita (1960) Constant No. of countries R2 (adj.)

0.369*** (3.23) 0.379*** (4.24) 2.785*** (7.41) 50 0.252

(2) 2.015*** (10.68)

0.287*** (9.15) 4.827*** (6.00) 50 0.733

(3) 1.980*** (9.12) 0.026 (0.34) 0.302*** (5.54) 4.737*** (5.54) 50 0.728

Source: Hanushek and Woessmann (2015a) Notes: Dependent variable: average annual growth rate in GDP per capita, 1960–2000. Cognitive skill measure refers to average score on all international tests 1964–2003 in math and science, primary through end of secondary school. t-statistics in parentheses: statistical significance at *10%, ** 5%, ***1%

38

E. A. Hanushek and L. Woessmann

Fig. 2 Knowledge capital and economic growth rates across countries. Notes: Added-variable plot of a regression of the average annual rate of growth (in percent) of real GDP per capita in 1960– 2000 on average test scores on international student achievement tests, average years of schooling in 1960, and initial level of real GDP per capita in 1960 (mean of unconditional variables added to each axis). (See Table 1, column 3. Source: Hanushek and Woessmann (2015a))

Figure 2 provides a graphical depiction of the basic results (and depictions that are essentially unchanged by the subsequent investigations of alternative specifications). Figure 2 plots the independent impact of knowledge capital on growth, based on column 3 of the table. In contrast to the earlier picture of the impact of school attainment on growth (Fig. 1), countries are now seen lying quite close to the overall line, indicating a very close relationship between educational achievement and economic growth. It is also instructive to plot the impact of school attainment on growth after considering cognitive skills. As seen in Fig. 3, the relationship is now flat: School attainment is not statistically significant in the presence of the direct cognitive skill measure of knowledge capital. This does not change when attainment is measured as the average between 1960 and 2000, rather than at the beginning of the period. The insignificance of school attainment, of course, does not mean that schooling is irrelevant. Measured skills are closely related to schooling, but life cycle skill accumulation depends upon the learning earlier in life. Achievement here is measured at various points during primary and secondary education. Even if tertiary education is simply additive, knowledge at earlier points in education will strongly influence the ultimate skill accumulation when students enter the labor force. As James Heckman and his colleagues have emphasized, there is a dynamic complementarity of investments such that further schooling has a larger impact on skills if it builds on a larger base developed earlier (Cunha & Heckman, 2007). The simple

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital. . .

39

Fig. 3 Years of schooling and economic growth rates after considering knowledge capital. Notes: Added-variable plot of a regression of the average annual rate of growth (in percent) of real GDP per capita in 1960–2000 on average years of schooling in 1960, average test scores on international student achievement tests, and initial level of real GDP per capita in 1960 (mean of unconditional variables added to each axis). See Table 1, column 3. (Source: Hanushek and Woessmann (2015a))

point is that “skill begets skill through a multiplier process” (Cunha et al., 2006), p. 698), such that additional attainment has a lessened impact if built upon lower basic skills. The insignificance of school attainment does suggest that simply investing in further schooling without ensuring commensurate improvements in cognitive skills does not lead to economic returns. Relatedly, a variety of people place extra weight on tertiary education (e.g., Ehrlich, 2007). However, without building on strong basic skills, such investment appears to have little extra value. Hanushek and Woessmann (2015a) find that in analysis of growth across both developed and developing countries, tertiary education has little added value in explaining economic growth after consideration of cognitive skills with the exception that US investments in higher education have signaled increased growth. See also Hanushek (2016). It is useful to consider the magnitude of the estimated impact of knowledge capital on growth. This is amplified below with country-specific estimates of growth impacts. Following a general convention, skill differences are measured in terms of standard deviations, where one standard deviation is, for example, the difference between the median student and the student at the 84th percentile of the international distribution. Almost all of the alternative specifications and modeling approaches suggest that one standard deviation higher cognitive skills of a country’s workforce is associated with approximately two percentage points higher annual growth in per capita GDP.

40

E. A. Hanushek and L. Woessmann

This magnitude is clearly substantial, particularly when compared to growth rates that average between 1.4 and 4.5 percent over the 1960–2000 period across broad regions. On the other hand, it is implausible to expect a country to improve by one standard deviation – bringing, say, Mexico up to the OECD average – over any reasonable time horizon. But it is plausible to think of getting schooling improvements that would lift a country’s average by ¼ standard deviation (25 points on a PISA scale). This kind of improvement has, for example, been observed by Mexico, Poland, Germany, and Turkey during the past decade and by Finland over the two to three decades before (see Organisation for Economic Co-operation and Development, 2013). The economic impact of such differences are considered below. Perhaps a leading competitor as a fundamental explanation of growth differences is the role of societal institutions – including the basic economic and legal structure of nations. This perspective, pursued importantly by Daron Acemoglu and his collaborators, links growth to some of the overall policies of countries. See, for example, the overview and discussion in Acemoglu, Johnson, and Robinson (2005) and Acemoglu and Robinson (2012). This perspective is actually quite complementary to the work described here. Hanushek and Woessmann (2015a) show that knowledge capital has a statistically significant and strong (albeit somewhat smaller) impact on growth when explicit institutional measures are included. A final issue addressed by Hanushek and Woessmann (2015a) is that the simple average of skills does not adequately reflect the policy options typically facing a nation. Specifically, one could institute policies chiefly directed to the lower end of the cognitive distribution, such as the Education for All initiative, or one could aim more at the top end, such as the focused technological colleges of India. It is possible to go beyond simple mean differences in scores and provide estimates of how growth is affected by the distribution of skills within countries and how it might interact with the nation’s technology. These estimates in Hanushek and Woessmann (2015a) suggest that improving both ends of the distribution is beneficial and complementary, i.e., the importance of highly skilled people is even larger with a more skilled labor force. Perhaps surprisingly, the highly skilled are even more important in developing countries that have scope for imitation than in developed countries that are innovating. In other words, both providing broad basic education – education for all – and pushing significant numbers to very high achievement levels have economic payoffs.

Causality The fundamental question is whether the tight relationship between cognitive skills and economic growth should be interpreted as a causal one that can support direct policy actions? In other words, if achievement were raised, should one really expect growth rates to go up by a commensurate amount? Hanushek and Woessmann (2012, 2015a, b) devote considerable attention to causality, and it is valuable to consider the strengths and weaknesses of those analyses. While the details of the

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital. . .

41

analyses become complicated, it is possible to bring out the ideas from these discussions. Work on differences in growth among countries, while extensive over the past two decades, has been plagued by legitimate questions about whether any truly causal effects have been identified, or whether the estimated statistical analyses simply pick up a correlation that emerges for other reasons. Knowing that the relationship is causal, and not simply a byproduct of some other factors, is clearly very important from a policy standpoint. Policymaking requires confidence that by improving academic achievement, countries will bring about a corresponding improvement in the long-run growth rate. If the relationship between test scores and growth rates simply reflects other factors that are correlated with both test scores and growth rates, policies designed to raise test scores may have little or no impact on the economy. The early studies that found positive effects of years of schooling on economic growth may well have been suffering from reverse causality; they correctly identified a relationship between improved growth and more schooling but incorrectly saw the latter as the cause and not the effect (see, e.g., Bils & Klenow, 2000). In this case, the data may have reflected the fact that as a country gets richer, it tends to buy more of many things, including more years of schooling for its population. There is less reason to think that higher student achievement is caused by economic growth. For one thing, scholars have found little impact of additional education spending on achievement outcomes, so it is unlikely that the relationship comes from growth-induced resources lifting student achievement (see the review in Hanushek and Woessmann (2011)). Still, it remains difficult to develop conclusive tests of causality with the limited sample of countries included in the analysis. The best way to increase confidence that higher student achievement causes economic growth is to consider explicitly alternative explanations of the observed achievement-growth relationship to determine whether plausible alternatives that could confound the results can be ruled out. No single approach can address all of the important concerns. But a combination of approaches – if together they provide support for a causal relationship between achievement and growth – can offer some assurance that the potentially problematic issues are not affecting the results. First, other factors besides cognitive skills may be responsible for countries’ economic growth. In an extensive investigation of alternative model specifications, Hanushek and Woessmann (2015a) employ different measures of cognitive skills, various groupings of countries (including some that eliminate regional differences), and specific sub-periods of economic growth. But the results show a consistency in the alternative estimates, in both quantitative impacts and statistical significance, that is uncommon in cross-country growth modeling. Nor do measures of geographical location, political stability, capital stock, and population growth significantly affect the estimated impact of cognitive skills. These specification tests rule out some basic problems attributable to omitted causal factors that have been noted in prior growth work.

42

E. A. Hanushek and L. Woessmann

Second, the most obvious reverse causality issues arise because the analysis reported above from Hanushek and Woessmann (2015a) relates growth rates over the period 1960–2000 to test scores for roughly the same period. To address this directly, it is possible to separate the timing of the analysis and to estimate the effect of test scores through 1984 on economic growth in the period since 1985 (until 2009). This analysis capitalizes directly on the long history of ILSAs. In this analysis, available for a sample of 25 countries only, test scores strictly pre-date the growth period, making it clear that increased growth could not be causing the higher test scores. This estimation shows a positive effect of early test scores on subsequent growth rates that is almost twice as large as that displayed above. Indeed, this fact itself may be significant, because it is consistent with the possibility that skills have become even more important for the economy in recent periods. Third, even if reverse causality were not an issue, it remains unclear that the important international differences in test scores reflect school policies. After all, differences in achievement may arise because of health and nutrition differences in the population or simply because of cultural differences regarding learning and testing. To address this, it is possible to focus attention on just the variations in achievement that arise directly from institutional characteristics of each country’s school system (exit examinations, autonomy, relative teacher salaries, and private schooling). The formal approach is called “instrumental variables.” (In order for this to be a valid approach, it must be the case that the institutions are not themselves related to differences in growth beyond their relation with test scores. For a fuller discussion, see Hanushek and Woessmann (2012). When the analysis is limited in this way, the estimation of the growth relationship yields essentially the same results as previously presented. The similarity of the results supports the causal interpretation of the effect of cognitive skills as well as the conclusion that schooling policies can have direct economic returns. Fourth, a possible alternative to the conclusion that high achievement drives economic growth not eliminated by the prior analysis is that countries with good economies also have good school systems. In this case, achievement is simply a reflection of other important aspects of the economy and not the driving force in growth. One simple way to test this possibility is to consider the implications of differences in measured skills within a single economy, thus eliminating institutional or cultural factors that may make the economies of different countries grow faster. This can readily be done by comparing immigrants to the United States who have been educated in their home countries with immigrants educated just in the United States. Since the two groups are within the single labor market of the United States, any differences in labor market returns associated with cognitive skills cannot arise from differences in the economy or culture of their home country. This comparison finds that the cognitive skills seen in the immigrant’s home country lead to higher incomes, but only if the immigrant was in fact educated in the home country. Immigrants from the same home country schooled in the United States see no economic return to home country test scores – a finding that pinpoints the value of better schools. These results hold when Mexicans (the largest US immigrant group) are excluded and when only

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital. . .

43

immigrants from English-speaking countries are included. While not free from problems, this comparative analysis rules out the possibility that test scores simply reflect cultural factors or economic institutions of the home country. It also lends further support to the potential role of schools in changing the cognitive skills of citizens in economically meaningful ways. Finally, perhaps the toughest test of causality is relating changes in test scores over time to changes in growth rates. If test score improvements actually increase growth rates, it should show up in such a relationship. For those countries that have participated in testing at different points over the past half century, it is possible to observe whether students seem to be getting better or worse over time. This approach implicitly eliminates country-specific economic and cultural factors because it looks at what happens over time within each country. For 12 OECD countries that have participated over a long time, the magnitude of trends in educational performance can be related to the magnitude of trends in growth rates over time. This investigation provides more evidence of the causal influence of cognitive skills (although the small number of countries is obviously problematic). The gains in test scores over time are very closely related to the gains in growth rates over time. Like the other approaches, this analysis must presume that the pattern of achievement changes has been occurring over a long time, because it is not the achievement of school children but the skills of workers that count. Nonetheless, the consistency of the patterns is striking, as is the similarity in magnitude of the estimates to the basic growth models. Again, each approach to determining causation is subject to its own uncertainty. Nonetheless, the combined evidence consistently points to the conclusion that differences in cognitive skills lead to significant differences in economic growth. Moreover, even if issues related to omitted factors or reverse causation remain, it seems very unlikely that these cause all of the estimated effects. Since the causality tests concentrate on the impact of schools, the evidence suggests that school policy, if effective in raising cognitive skills, can be an important force in economic development. While other factors – culture, health, and so forth – may affect the level of cognitive skills in an economy, schools clearly contribute to the development of human capital. More years of schooling in a system that is not well designed to enhance learning, however, will have little effect.

The Gains from Universal Basic Skills A simple way to assess the importance of knowledge capital is to look at the potential impacts that can be related to improved knowledge capital of nations according to historical growth patterns. One simple example is a version of the Sustainable Development Goal related to education: equitable provision of quality education at least through primary and lower secondary school. Projections that can be developed from the PISA testing provide an image of the potential gains around the world from improved education.

44

E. A. Hanushek and L. Woessmann

The Global Challenge Looking at the 15-year-old population of different countries (which matches the age of testing in PISA), it is possible to understand both variations in school attainment and in quality for a large number of countries of the world. While much of the discussion of education quality focuses on average test scores for countries, it is perhaps better to focus on lower achievement in order to relate such analysis to the minimal goals outlined in the SDGs. There is reason to be concerned about the number of countries that have yet to ensure broad access to and enrollment in secondary schools. Figure 4 displays net enrollment rates at the tested age for each of the 76 countries in the test sample. (See Hanushek and Woessmann (2015b) for a description of the sample of countries and of the underlying data.) While 44 of the countries have over 95 percent participation of their 15-year-olds in 2012, the participation rates begin to fall significantly after this point. In the bottom 17 countries, enrollment rates are less than 80 percent. The meaning of quality education, of course, is subject to considerable discussion and debate. A convenient starting point to illustrate the economic impact of

Fig. 4 Secondary school enrolment rates. Notes: PISA participants: share of 15-year-olds enrolled in school; TIMSS (non-PISA) participants: net enrolment ratio in secondary education (% of relevant group). (Source: Hanushek and Woessmann (2015b))

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital. . .

45

improvement is to define minimal performance as satisfying Level 1 on the PISA tests. The description of this performance level (for math) is: At Level 1, students can answer questions involving familiar contexts where all relevant information is present and the questions are clearly defined. They are able to identify information and to carry out routine procedures according to direct instructions in explicit situations. They can perform actions that are almost always obvious and follow immediately from the given stimuli. (OECD, 2013)

This skill level might be interpreted as the minimal level for somebody to participate effectively in the modern global economy. Figure 5 presents for the 76 countries the share of youth in school falling below the level indicating minimal skills – below 420 on the mathematics and science assessments on the PISA tests. Figure 5 is of course closely related to what would be seen for average country scores, though in fact the rankings are somewhat different. Nine of the 76 countries have more than two-thirds of their students failing to meet

Fig. 5 Share of students not attaining basic skills. Notes: Share of students performing below 420 points on international student achievement test. Average of mathematics and science. PISA participants: based on PISA 2012 micro data; TIMSS (non-PISA) participants: based on eighthgrade TIMSS 2011 micro data, transformed to PISA scale. (Source: Hanushek and Woessmann (2015b))

46

E. A. Hanushek and L. Woessmann

this level of minimum skills (Ghana, Honduras, South Africa, Morocco, Indonesia, Peru, Qatar, Colombia, and Botswana). Hong Kong, Estonia, Korea, and Singapore lead at the other end of the distribution, but even these countries face the challenge of ensuring that all youth attain minimal skill levels. It is important to recognize that the richest countries of the world also have significant populations without minimal skills: Luxemburg (25 percent), Norway (22 percent), United States (24 percent), and Switzerland (14 percent). In other words, the development goal is significant and real for all the countries of the world.

Economic Impacts of Universal Basic Skills It is possible to use the previously displayed growth model (Table 1) to estimate the economic impact of improving the education picture seen in Figs. 4 and 5. Specifically, the growth relationships found in Table 1 provide a means of simulating how growth and thus future GDP would be altered if a country altered its knowledge capital. Schooling policies obviously take time to have their effect on the labor force and on future growth, but various school improvements can be traced out – assuming that the past relationships hold into the future. Hanushek and Woessmann (2015b) provide alternative projections based on reforms beginning in 2015; these include meeting the SDG #4 goal. The education development goal is framed as the standard that should be met by 2030, making their assumption of linear improvement from today’s schooling situation to attainment of the goal in 15 years natural. But of course the labor force itself will only become more skilled as increasing numbers of new, better trained people enter the labor market and replace the less skilled who retire. If a typical worker remains in the labor force for 40 years, the labor force will not be made up of fully skilled workers until 55 years have passed (15 years of reform and 40 years of replacing less skilled workers as they retire). They calculate the growth rate of the economy (according to the estimate of 1.98 percent higher annual growth rate per standard deviation in educational achievement; see column 3 of Table 1) each year into the future based on the average skill of workers (which changes as new, more skilled workers enter). The expected level of gain in GDP with an improved workforce comes from comparing GDP with a more skilled labor force to that with the existing workforce from 2015 until 2095. The growth of the economy with the current level of skills is projected to be 1.5 percent or the rough average of OECD growth over the past two decades. The projection is carried out for 80 years, which corresponds to the life expectancy of somebody born in 2015. Future gains in GDP are discounted from the present with a 3 percent discount rate. The initial GDP refers to 2015 estimates based on purchasing-power-parity (PPP) calculations in current international dollars; see International Monetary Fund (2014). The resulting present value of additions to GDP is thus directly comparable

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital. . .

47

Table 2 Gains from achieving universal basic skills Lower middle-income countries Upper middle-income countries High-income non-OECD countries High-income OECD countries

In % of current GDP 1302% 731% 473% 162%

In % of discounted future GDP 27.9% 15.6% 10.1% 3.5%

Source: Hanushek and Woessmann (2015b) Notes: Discounted value of future increases in GDP until 2095 due to a reform that achieves full participation in secondary school and brings each student to a minimum of 420 PISA points. Simple averages of countries in each income group

to the current levels of GDP. It is also possible to compare the gains to the discounted value of projected future GDP without reform to arrive at the average increase in GDP over the 80 years. Table 2 summarizes the projected gains for each of the groupings of countries under an assumption that each country achieves universal basic skills of its students by 2030. Unsurprisingly, the lowest-income countries of the sample would show by far the largest gains. The simple estimates for the eight lower middle-income countries indicate a present value of gains averaging 13 times the current GDP of these countries. Translated into a percentage of future GDP, this implies a GDP that is 28 percent higher on average every year for the next 80 years. By the end of the projection period in 2095, GDP with school improvement would average some 140 percent greater than would be expected with the current skills of the labor force. Increases of this magnitude are, of course, unlikely, because the projected gains in achievement over the next decade and a half are outside any real expectations. Ghana and Honduras, for example, would require an increase in achievement of over one standard deviation during this 15-year period. Nothing like that has ever been seen. But the calculations do show the value of improvement and suggest the lengths to which a country should be willing to go to improve its schools. Figure 6 compares the gains from meeting universal basic skills (in present value terms) to current GDP. Perhaps the most interesting part of the figure is the right-hand side. It shows that among the high-income non-OECD countries, the impact on the oil-producing countries is particularly dramatic. Improved minimal skills among the populations of Oman, Qatar, and Saudi Arabia imply gains exceeding eight times current GDP for these countries, and Bahrain follows closely. If the price of oil falls, say through new technologies, these countries will have to rely on the skills of their populations – and the data suggest there is substantial room for improvement. Equally interesting are the high-income OECD countries, which typically don’t figure in discussions of development goals. For 8 of these 31 countries, the present value of GDP gains from meeting the minimal skills goal would be more than twice the size of their current GDP. In order, OECD countries with gains exceeding twice

48

E. A. Hanushek and L. Woessmann

Fig. 6 Effect on GDP if all children acquire basic skills (in % of current GDP). Notes: Discounted value of future increases in GDP until 2095 simulated for each country based on the growth models in Table 1, column 3. The simulations assumed a 15-year school reform that achieves full participation in secondary school and brings each student to a minimum of 420 PISA points, expressed as a percentage of current GDP. Value is 3881% for Ghana, 2016% for Honduras, 2624% for South Africa, 1427% for Oman, and 1029% for Qatar. (Source: Hanushek and Woessmann (2015b))

GDP are Chile, Israel, the Slovak Republic, Greece, Italy, France, Sweden, and Luxemburg. The average gain across the high-income OECD countries is 162 percent of current GDP. This implies a GDP that is on average 3.5 percent higher than would be expected with no improvement in the schools (see Fig. 7). Almost all of the gain comes from improving achievement at the bottom end, since enrollment in these countries is near universal. The lowest secondary enrollment rates among high-income OECD countries are found in Chile (92 percent), Italy (94 percent), Greece (95 percent), and France (95 percent). Interestingly, particularly given the international debates on educational goals, the impact of improving student quality is almost always larger than the impact of improving access at current quality levels of each nation. The alternative scenarios are found originally in Hanushek and Woessmann (2015b). Figure 8 shows the economic gains (in terms of percent of future GDP) for three alternative improvements: (1) achieving basic skills just for those currently in school; (2) expanding access to school for all with current quality levels; and (3) achieving universal basic

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital. . .

49

Fig. 7 Effect on GDP if all children acquire basic skills (in % of discounted future GDP). Notes: Discounted value of future increases in GDP until 2095 simulated for each country based on the growth models in Table 1, column 3. The simulations assumed a 15-year school reform that achieves full participation in secondary school and brings each student to a minimum of 420 PISA points, expressed as a percentage of discounted future GDP. Value is 83.0% for Ghana, 56.1% for South Africa, and 30.5% for Oman. (Source: Hanushek and Woessmann (2015b))

skills. Even for the lower middle-income countries, improving quality completely dominates just having full access. Of course, doing both is significantly better than pursuing either partial option.

Conclusions The implicit message of this chapter is that ILSAs provide particularly relevant data for considering economic development issues. ILSAs permit direct comparisons of the skills of the citizens across a broad range of countries. Development goals in education are generally easy to motivate both within individual countries and within international development agencies because there is ready acceptance of the idea that nations’ growth is directly related to human capital – the skills of the populations. The disappointments have come largely from an undue focus on school attainment as opposed to learning. Historically, by

50

E. A. Hanushek and L. Woessmann

Average Gains from Improving Skills (average % increase in future GDP) 30% 25% 20% 15% 10% 5% 0% Basic skills, current access Full access, current quality

Universal basic skills

lower middle income

upper middle income

high income nonOECD

high income OECD

Fig. 8 Average gains for countries at different income levels from improving skills. Notes: Average improvements across countries by income groups of the simulations underlying Fig. 7 except done separately for three separate reforms: 1. Bring all students currently in school to basic skill levels; 2. ensuring access to all children at the current quality level of the schools; and 3. bringing all students up to basic skill level. (Source: Hanushek and Woessmann (2020))

relying on measured years of school attainment to assess progress toward educational goals, success was more apparent than real. Over the past two decades, school attainment in developing countries has grown significantly, but learning has not grown commensurately. The results reported here build on prior work that focuses on the gains from learning and that supports the finding of a strong, causal relationship between cognitive skills and economic growth. The Sustainable Development Goals indicate that a fundamental education goal for all nations can be succinctly stated: all youth should achieve minimal skills. We suggest that a workable definition of minimal skills in today’s economically competitive world is fully mastering Level 1 skills on the PISA tests, which is equivalent to a mathematics score of 420. Importantly, this quality aspect of the education goal can be readily measured and tracked, thus providing a similar impetus to development than the prior focus on attainment did. The history of economic growth makes understanding the economic implications of different educational policy outcomes straightforward. Three options that represent much of current policy discussion can be readily compared: bringing all current students up to minimal skills; universal access to schools at current quality; and universal access with minimal skills. Universal access at current quality yields some economic gains, particularly in the lower-income countries. But improving the quality of schools to raise achievement for current students has a much larger economic impact. Meeting the goal of universal access with basic skills for all has an even greater impact. For lower middle-income

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital. . .

51

countries, the discounted present value of future gains would be 13 times the current GDP and would average out to a 28 percent higher GDP over the next 80 years. For upper middle-income countries, it would average out to a 16 percent higher GDP. The goal of universal minimal skills also has meaning for high-income countries. Driven in part by oil-producing countries that face some schooling challenges, the high-income non-OECD countries as a group would see an average of 10 percent higher future GDP – almost five times the value of current GDP – if they met this goal. But even the high-income OECD countries would gain significantly from bringing all portions of the population up to basic skills; for this group, future GDP would be 3.5 percent higher than it would be otherwise. Improving the skills of the population clearly has substantial implications for economic well-being, in particular when improvements that accrue in the more distant future are also considered.

Cross-References ▶ Educational Accountability and the Role of International Large-Scale Assessments ▶ Methods of Causal Analysis with ILSA Data ▶ The Role of International Large-Scale Assessments (ILSAs) in Economically Developing Countries ▶ Using ILSAs to Promote Quality and Equity in Education: The Contribution of the Dynamic Model of Educational Effectiveness

References Acemoglu, D. (2009). Introduction to modern economic growth. Princeton. Acemoglu, D., Johnson, S., & Robinson, J. A. (2005). Institutions as a fundamental cause of longrun growth. In P. Aghion & S. N. Durlauf (Eds.), Handbook of economic growth (pp. 385–472). North Holland. Acemoglu, D., & Robinson, J. A. (2012). Why nations fail: The origins of power, prosperity, and poverty. Crown Publishers. Aghion, P., & Howitt, P. (1998). Endogenous growth theory. MIT Press. Aghion, P., & Howitt, P. (2009). The economics of growth. MIT Press. Almlund, M., Duckworth, A. L., Heckman, J., & Kautz, T. (2011). Personality psychology and economics. In E. A. Hanushek, S. Machin, & L. Woessmann (Eds.), Handbook of the economics of education, Vol. 4 (pp. 1–181). North Holland. Barro, R. J. (1991). Economic growth in a cross section of countries. Quarterly Journal of Economics, 106(2), 407–443. Barro, R. J. (1997). Determinants of economic growth: A cross-country empirical study. MIT Press. Barro, R. J., & Lee, J.-W. (1993). International comparisons of educational attainment. Journal of Monetary Economics, 32(3), 363–394. Barro, R. J., & Lee, J.-W. (2001). International data on educational attainment: Updates and implications. Oxford Economic Papers, 53(3), 541–563. Barro, R. J., & Lee, J.-W. (2013). A new data set of educational attainment in the world, 1950–2010. Journal of Development Economics, 104, 184–198. Barro, R. J., & Sala-i-Martin, X. (2004). Economic growth (2nd ed.). The MIT Press.

52

E. A. Hanushek and L. Woessmann

Becker, G. S. (1964). Human capital: A theoretical and empirical analysis, with special reference to education. National Bureau of Economic Research. Benhabib, J., & Spiegel, M. M. (1994). The role of human capital in economic development: Evidence from aggregate cross-country data. Journal of Monetary Economics, 34(2), 143–174. Benhabib, J., & Spiegel, M. M. (2005). Human capital and technology diffusion. In P. Aghion & S. N. Durlauf (Eds.), Handbook of economic growth (pp. 935–966). North Holland. Bils, M., & Klenow, P. J. (2000). Does schooling cause growth? American Economic Review, 90(5), 1160–1183. Borghans, L., Duckworth, A. L., Heckman, J. J., & ter Weel, B. (2008). The economics and psychology of personality traits. Journal of Human Resources, 43(4), 972–1059. Bowles, S., Gintis, H., & Osborne, M. (2001). The determinants of earnings: A behavioral approach. Journal of Economic Literature, 39(4), 1137–1176. Brown, G., Micklewright, J., Schnepf, S. V., & Waldmann, R. (2007). International surveys of educational achievement: How robust are the findings? Journal of the Royal Statistical Society A, 170(3), 623–646. Cunha, F., & Heckman, J. J. (2007). The technology of skill formation. American Economic Review, 97(2), 31–47. Cunha, F., Heckman, J. J., Lochner, L., & Masterov, D. V. (2006). Interpreting the evidence on life cycle skill formation. In E. A. Hanushek & F. Welch (Eds.), Handbook of the economics of education (pp. 697–812). Elsevier. Ehrlich, Isaac. (2007, January). The mystery of human capital as engine of growth, or why the US became the economic superpower in the 20th century (NBER working paper 12868). National Bureau of economic research. Hanushek, E. A. (2002). Publicly provided education. In A. J. Auerbach & M. Feldstein (Eds.), Handbook of Public Economics, Vol. 4 (pp. 2045–2141). North Holland. Hanushek, E. A. (2016). Will more higher education improve economic growth? Oxford Review of Economic Policy, 32(4), 538–552. Hanushek, E. A., & Kimko, D. D. (2000). Schooling, labor force quality, and the growth of nations. American Economic Review, 90(5), 1184–1208. Hanushek, E. A., & Woessmann, L. (2008). The role of cognitive skills in economic development. Journal of Economic Literature, 46(3), 607–668. Hanushek, E. A., & Woessmann, L. (2011). The economics of international differences in educational achievement. In E. A. Hanushek, S. Machin, & L. Woessmann (Eds.), Handbook of the economics of education, Vol. 3 (pp. 89–200). North Holland. Hanushek, E. A., & Woessmann, L. (2012). Do better schools lead to more growth? Cognitive skills, economic outcomes, and causation. Journal of Economic Growth, 17(4), 267–321. Hanushek, E. A., & Woessmann, L. (2015a). The knowledge capital of nations: Education and the economics of growth. MIT Press. Hanushek, E. A., & Woessmann, L. (2015b). Universal basic skills: What countries stand to gain. Organisation for Economic Co-operation and Development. Hanushek, E. A., & Woessmann, L. (2020). Education, knowledge capital, and economic growth. In S. Bradley & C. Green (Eds.), He economics of education: A comprehensive overview (pp. 171–182). Academic Press. Hanushek, E. A., & Zhang, L. (2009). Quality-consistent estimates of international schooling and skill gradients. Journal of Human Capital, 3(2), 107–143. Heckman, J. J., Stixrud, J., & Urzua, S. (2006). The effects of cognitive and noncognitive abilities on labor market outcomes and social behavior. Journal of Labor Economics, 24(3), 411–482. International Monetary Fund. (2014). World economic outlook, October 2014. International Monetary Fund. Jones, C. I., & Vollrath, D. (2013). Introduction to economic growth (3rd ed.). W.W. Norton and Company. Levine, R., & Renelt, D. (1992). A sensitivity analysis of cross-country growth regressions. American Economic Review, 82(4), 942–963.

3

The Political Economy of ILSAs in Education: The Role of Knowledge Capital. . .

53

Levine, R., & Zervos, S. J. (1993). What we have learned about policy and growth from crosscountry regressions. American Economic Review, 83(2), 426–430. Lindqvist, E., & Vestman, R. (2011). The labor market returns to cognitive and noncognitive ability: Evidence from the Swedish enlistment. American Economic Journal: Applied Economics, 3(1), 101–128. Lucas, R. E., Jr. (1988). On the mechanics of economic development. Journal of Monetary Economics, 22(1), 3–42. Mankiw, N. G., Romer, D., & Weil, D. (1992). A contribution to the empirics of economic growth. Quarterly Journal of Economics, 107(2), 407–437. Mincer, J. (1974). Schooling, experience, and earnings. NBER. Nelson, R. R., & Phelps, E. (1966). Investment in humans, technology diffusion and economic growth. American Economic Review, 56(2), 69–75. OECD. (2013). PISA 2012 results: What students know and can do – Student performance in mathematics, reading and science (volume I). Organisation for Economic Co-operation and Development. Organisation for Economic Co-operation and Development. (2013). PISA 2012 results: What students know and can do – Student performance in mathematics, reading and science (volume I). OECD. Pritchett, L. (2001). Where has all the education gone? World Bank Economic Review, 15(3), 367–391. Pritchett, L. (2006). Does learning to add up add up? The returns to schooling in aggregate data. In E. A. Hanushek & F. Welch (Eds.), Handbook of the economics of education (pp. 635–695). North Holland. Romer, P. (1990). Endogenous technological change. Journal of Political Economy, 99(5,pt. II), S71–S102. Sala-i-Martin, X., Doppelhofer, G., & Miller, R. I. (2004). Determinants of long-term growth: A Bayesian averaging of classical estimates (BACE) approach. American Economic Review, 94(4), 813–835. Schultz, T. W. (1961). Investment in human capital. American Economic Review, 51(1), 1–17. Schumpeter, Joseph A. 1912[2006]. Theorie der wirtschaftlichen Entwicklung. Duncker & Humblot (English translation: The Theory of Economic Development). Solow, R. M. (1956). A contribution to the theory of economic growth. Quarterly Journal of Economics, 70(1), 65–94. Welch, F. (1970). Education in production. Journal of Political Economy, 78(1), 35–59.

4

Reasons for Participation in International Large-Scale Assessments Ji Liu and Gita Steiner-Khamsi

Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Appeal of International Large-Scale Student Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Creating a Demand for ILSAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56 58 62 69 70

Abstract

International large-scale assessments (ILSAs) are attracting global attention, but examining reasons for this sharp increase in demand is an under-explored area. Drawing on key concepts in policy borrowing research, this chapter synthesizes key explanations why ILSAs are so attractive to policy makers. From a demand side perspective, ILSAs captivate countries by projecting as follows: (i) the comparative advantage of numbers over narratives, (ii) the quest for a credible source of information in an era marked by a surplus of evidence, (iii) the weak link between national curriculum and some ILSAs, and as (iv) transnational accreditation of public education. On the supply side, two brand-new developments are identified: (i) the preoccupation with linking test accountability to “education is in crisis” and (ii) new ILSA derivative tools that urge countries to reconsider partial and non-participation. In the age of ILSA expansion and test-

J. Liu Faculty of Education, Shaanxi Normal University, Xi’an, People’s Republic of China e-mail: [email protected] G. Steiner-Khamsi (*) Teachers College, Columbia University, New York, NY, USA e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_5

55

56

J. Liu and G. Steiner-Khamsi

based accountability, countries are saturated in a surplus of assessments; yet, the predominant policy advice remains singular, and countries are discouraged from alternative non-standardized paths for measuring learning. Keywords

ILSAs · Participation · Policy borrowing · Learning · Human Capital Index

There seems to be agreement that participation in international large-scale assessments (ILSAs) generates, depending on the results, reform pressure or reform alleviation, respectively. In other words, ILSA participation is likely to have an impact on agenda setting, because governments tend to use results from Programme for International Student Assessment (PISA), Trends in International Mathematics and Science Study (TIMSS), or other ILSAs politically to draw public attention, mobilize financial resources, or build a coalition across party lines for the sake of educational reform. In contrast, the impact of participation in ILSAs on actual policy formulation is less clear. How do policy makers use ILSAs, what exactly do they learn from ILSA results, and – judging from the ever-growing number of governments participating round after round in various ILSAs – why are ILSAs apparently so attractive to policy makers and continuously so? If formulated differently: why do governments keep allocating massive amounts of financial and human resources to have their students assessed, round after round, test after test, and school grade after school grade by a growing number of internationally benchmarked standardized tests? In this chapter, we explore the reasons for why governments participate in ILSAs. On the demand side, we identify a few features of ILSAs that make them attractive to policy makers. On the supply side, we discuss a few recent developments that help explain how international organizations ensure the continuous interest in ILSA participation and thereby sustain the demand for international standardized testing. Both approaches to our inquiry – demand and supply side reasons for ILSA participation – rely on key concepts of policy borrowing research and, as a corollary, draw inspiration from sociological systems theory.

Introduction Arguably, the interpretive framework of policy borrowing research lends itself to explain the growing influence of global actors (e.g., the OECD or World Bank), global monitoring instruments (e.g., ILSAs, Education at a Glance, Global Education Monitoring Report), and global education policies (e.g., test-based accountability in school reforms, competency-based curriculum reform, accreditation policies in higher education) on national education reforms. There are four concepts, in particular, that are relevant for understanding the appeal of ILSAs: reception, translation, externalization, and reference societies. These concepts are inspired by sociological systems theory (Luhmann, 1990).

4

Reasons for Participation in International Large-Scale Assessments

57

Our group of policy borrowing researchers applies a systems theoretical perspective to understand why and when local policy actors externalize (Schriewer, 1990) and how they translate these external impulses for their own agenda setting and policy formulation. In other words, the commonality among the system theoretical scholars is the analytical focus on the local policy context. Rather than examining what international organizations such as the OECD, global education policies, or international “best practices” do to national educational system, the system theoretical focus pursues the inverse approach: how do national policy actors (politically) utilize or instrumentalize the transnational or the global dimension to induce change in the local policy context. According to sociological systems theory, self-referential systems produce their own causes and determine what counts as internal and external causation. As with other systems theoretical notions, the ecological perspective is manifest in the concept of externalization. Since systems are considered to be operatively closed but cognitively open, everything around them is environment and therefore observable (see SteinerKhamsi, 2021). This applies specifically to organizational systems such as national education systems that, at particular moments, receive and translate codes from other subsystems, other national education systems, or from the “world” into their own national context or own “sociologic” (Schriewer & Martinez, 2004, p. 33). External causes – such as the move to outcomes-based governance and knowledgebased regulation in new public management – are effective only if they resonate internally, are rendered meaningful, and are translated into the logic and language code of the system. For this very reason, system-theoretical policy borrowing researchers sharpen their lens for detecting reception and translation processes: which “irritation” or protracted policy conflict explains receptiveness for externalization, that is, lesson-drawing from other education systems, other function systems (notably, the economy), or from “the world” in the form of best practices or international standards broadly defined. Likewise, how are borrowed discourses translated into the code, language, and practices of the function system of education? Our preoccupation with the national level attempts to bring to light the performative act of systems. At particular moments, organizational systems may generate national boundaries and reassert themselves as national entities in order to make it appear as if that there is (global) external pressure for reform or change. In the same vein, they construct other national systems as reference societies at particular moments to suggest that lessons should be drawn from these systems. The political act of externalization serves to unify or – to use a notion used in policy studies – to build coalitions in support of an educational policy. It is important to point out here that every political act of externalization necessitates, but also contributes to, the social construction of the nation as an acting subject. Our attention is directed to how governments deal with policy contestation; at what moments do they resort to the semantics of the “global,” PISA, TIMSS, or other global metrics; how they manage to frame them vis-à-vis their national constituents as an external (transnational) authority; and, finally, what impact their acts of externalization have on authorizing controversial educational policies in their country.

58

J. Liu and G. Steiner-Khamsi

Finally, the concept of reference societies (Schriewer & Martinez, 2004) and counter-reference societies (Waldow, 2019) is ideally suited for explaining why politicians and policy makers are more enamored with PISA rather than with TIMSS, with PISA rather than with PISA-D, and with the league leader Finland rather than with Shanghai: TIMSS is ubiquitous, whereas PISA is associated with the attractive geopolitical space that OECD countries inhabit; PISA-D is for the poor and PISA for the middle- and upper-income economies; finally, for many non-Asian countries that hold stereotypical views of Asian education (i.e., “tiger mothers”), Shanghai is considered a counter-reference society, whereas Finland constitutes a reference society which, at least rhetorically, may be used for policy borrowing. In comparative education, the term “reference society” is key for an epistemological understanding of cross-national policy attraction, a subfield of policy borrowing research. At center stage is the question of which society or which national educational system is referenced, considered comparable, and therefore used as a model for lesson-drawing. This body of scholarship is closely associated with studies on “reference society” presented by sociologist Reinhard Bendix (1978, p. 292; see Waldow, 2019, p. 178). Bendix used the term to denote how governments used economic competitors and military rivals as reference societies for their own development. One of the examples, discussed by Bendix, is the fascination of Meiji era Japan with the West. The link between reference society and political alliances has also been well documented in comparative education research. Examples include cases of a radical change in reference societies as a result of fundamental political changes in PR China, the former Soviet Union, and Spain (Schriewer & Martinez, 2004), in post-Soviet Latvia (Silova, 2006), and in post-socialist Mongolia (SteinerKhamsi & Stolpe, 2006).

The Appeal of International Large-Scale Student Assessments There exist at least four reasons why ILSAs and, more narrowly, PISA are so high in demand among politicians and policy makers: (i) the comparative advantage of numbers over narratives, (ii) the quest for a credible source of evidence in an era marked by a surplus of evidence, (iii) the weak link between national curriculum and PISA, and (iv) transnational accreditation of public education. First, Wendy Espeland (2015) and Radhika Gorur (2015) observe the advantages of numbers over complex narratives because one may attach one’s own narratives to numbers. What is especially appealing to policy actors are studies that produce statistics, scores, ranking, and benchmarks that are based on international comparison or on comparison over time. Espeland (2015, p. 56) explains the dual process of simplification and elaboration involved in the usage of numbers. In a first step, numbers tend to “erase narratives” by systematically removing the persons, institutions, or systems being evaluated by the indicator and the researcher doing the evaluation. This technology of simplification stimulates narratives, or as Espeland astutely observes:

4

Reasons for Participation in International Large-Scale Assessments

59

If the main job of indicators is to classify, reduce, simplify, to make visible certain kinds of knowledge, indicators are also generative in ways we sometimes ignore: the evoke narratives, stories about what the indicators mean, what are their virtues or limitations, who should use them to what effect, their promises and their failings. (Espeland, 2015, p. 65)

Published in 2003, that is, after the first wave of PISA shocks but before the exponential growth of global surveys, toolkits, databanks, and tests, Antonio Novoa and Talia Yariv-Mashal noticed the politics of international comparison and examined how: [T]his ongoing collection, production and publication of surveys leads to an ‘instant democracy’, a regime of urgency that provokes a permanent need for self-justification. (Novoa & Yariv-Mashal, 2003, p. 427)

The fact that policy makers commonly interpret PISA results in line with their own policy agenda is best captured in the term “projections,” a term that Florian Waldow coined to capture the media accounts in Germany with regard to the “Finnish success” (Waldow, 2016). Media narratives of why Finnish students scored top on PISA had more to do with controversial policy issues in Germany than with the actual features of the Finnish educational system. Ironically, both proponents and opponents of German student-tracking policies, for example, referred to Finnish success in order to substantiate their beliefs. As Waldow poignantly contents: [t]he important thing is what observers want to see, to the extent that what is observed may not even exist in the place that is being referred to. Conceptions of “good” and “bad” education are being projected onto other countries or regions like a slide or a film are projected onto a projection screen. The ensuing image is mostly determined by the “slide,” not by the “screen.” The main function of these projections is the legitimation and delegitimation of educational policies and agendas in the place from which the projection is made. (Waldow, 2019, p. 5)

In order to test the projection hypothesis empirically, Waldow and SteinerKhamsi solicited contributions from noted scholars that examined the reception and translation of league leaders, slippers, and losers in their own national context (Steiner-Khamsi & Waldow, 2018; Waldow & Steiner-Khamsi, 2019). The focus on the idiosyncrasies of a system and its national forms of organization brings a fascinating phenomenon to light that at first sight appears to be contradictory: despite the widespread rhetoric of learning from “best performing” school systems, there is no universal consensus on why some school systems do better than others on tests such as PISA. On the contrary, there is great variation in how national governments, media, and research institutions explain Finland’s, Shanghai’s or Singapore’s “success” in PISA or TIMSS. “Finnish success” is a good case in point. There is a long list of explanations for why Finnish students do well on ILSAs. Depending on what the controversial policy issue is for which policy actors seek a (internally produced) quasi-external authoritative answer, Finland’s success is attributed to its strong university-based teacher education system, the system of comprehensive schooling

60

J. Liu and G. Steiner-Khamsi

with minimal tracking of students, or the nurturing environment in schools where students ironically are exposed to very few high-stakes standardized tests. The same applies for the league leaders themselves: depending on the timing, notably whether the positive results are released at the end or the beginning of a reform cycle, the policy actors tend to take credit for the positive results or, on the contrary, belittle the success and proclaim that the students performed well for all the wrong reasons, including private tutoring, stressful school environment, and learning to the test (see Waldow & Steiner-Khamsi, 2019). Second, the boom in evidence-based policy planning has generated a surplus of evidence to the extent that there is now the challenge of how to weed out irrelevant evidence based on comparability and credibility criteria. Concretely, in the wake of complexity reduction, we are beginning to witness today a hierarchization of information (very often with randomized-controlled trials on the top and qualitative data at the bottom), rendering some types of evidence more relevant than others. At the same time, the disclosure of the source of information to make a case for the credibility of the evidence, that is, the reference, has become as important, if not more important, than the information itself. In fact, the legitimacy of the assertion rests in great part on the source of information itself. For example, a reference here and there to OECD studies has become a sine qua non for policy analysts in Europe, because OECD is seen, in the Foucauldian sense, as a founder of discursiveness for a very special kind of policy knowledge that ranks top in the hierarchy of evidence, one that operates with numbers and draws on international comparison to enforce a political program of accountability. Christian Ydesen and his associates (2019) have convincingly documented the rise of OECD as a global education governing complex that uses a range of policy instruments (PISA, Education at a Glance, country reports, etc.) to diagnose and monitor national developments and advance global solutions of a particular kind for national reforms. As mentioned earlier, PISA is, more so than TIMSS and PIRLS, associated with the attractive geopolitical “educational space” of affluent OECD countries (Novoa & Lawn, 2002). Similarly, the propensity to reference OECD to authorize national policy decisions has been clearly visible in the five-country study “Policy Knowledge and Lesson Drawing in an Era of International Comparison” (POLNET), funded by the Norwegian Research Council and coordinated at the University of Oslo. The bibliometric network analyses of key policy documents reveal that OECD publications rank top in all five Nordic countries (Denmark, Finland, Iceland, Norway, Sweden) as compared to other international texts published outside the Nordic region (see Karseth et al. 2021). Third, the difficulty to draw tangible conclusions from PISA results for curriculum reform may actually be considered a strength, rather than a weakness, of the test. With the exception of the league leaders, PISA has become a stamp of attestation that national educational systems do not teach students twenty-firstcentury skills. However, what they would need to do in terms of the curriculum in order to do so remains unclear (see Labaree, 2014). For a long time, PISA’s focus on global competencies was accompanied by a complete disregard for national

4

Reasons for Participation in International Large-Scale Assessments

61

curricula and therefore enabled policy makers to fill the vacuum by providing their own explanations of why their system is, or is not, preparing the students for twenty-first-century skills (Recently, OECD has addressed the loose coupling between PISA and national curricula in its Curriculum Redesign initiative, developed as part of its Future of Education and Skills 2030 project. See, in particular, the OECD overall report: https://www.oecd.org/education/2030project/curriculum-analysis/ and an example of a country report, i.e., Wales: http://www.oecd.org/publications/achieving-the-new-curriculum-for-wales4b483953-en.htm). PISA is often seen as an empty vessel that policy actors may fill with meaning to leverage reform pressure and mobilize resources. To demystify the promotional, commonsensical rationale of lesson-drawing, Addey, Sellar, Steiner-Khamsi, Lingard, and Verger (2017) examine the wide range of reasons why governments participate in ILSAs. The seven most common reasons for governments’ engagement with ILSAs are (1) evidence for policy; (2) technical capacity building; (3) funding and aid; (4) international relations; (5) national politics; (6) economic rationales; and (7) curriculum and pedagogy. The authors contend that only one of the seven main reasons is directly related to curriculum reform and pedagogy. Finally, PISA in particular functions symbolically very much like a transnational accreditation of national (public) education. In stark contrast to technical-vocational education, as well as private providers (International Baccalaureate Organisation, Cambridge Assessment International Education, etc.), there exists no transnational accreditation for the public education system. The situation is different in technicalvocational education. As shown by Eva B. Hartmann (2016), “endogenous privatization” is much more advanced in technical-vocational education than in the public school system. There, certification is decentralized, fragmented, and diversified propelling the rise of transnational certifiers that no longer require a government, a legislative body, or a profession for accreditation or legitimization (Hartmann, 2016). In the public school system, however, the transnational certifiers are either private providers or, symbolically, ILSAs. In a study of international standard schools, Steiner-Khamsi and Dugonjic (2018) noticed the exponential growth of public-private partnerships in which governments fund privately run international schools with the aim to enrich the regular curriculum, to improve the quality of education in public schools, to introduce English as a language of instruction, or simply to internationalize public schools. By extension, public means “national” and private “international.” A project of the modern nation-state, compulsory education is national in terms of accreditation, teaching content, and language of instruction. In contrast, private providers are able to orient themselves and operate both at a national and an international scale. Without doubt, the ubiquitous talk of global markets and the attractiveness of international student mobility have helped boost attractiveness of transnational accreditation. If the trend continues, “international” is likely to become increasingly positively associated with cosmopolitanism and “national” with backwardness and parochialism. In an era of globalization, the national orientation has become in and of itself a burden to governments and has become an object of intense critical

62

J. Liu and G. Steiner-Khamsi

scrutiny. The ability of PISA, TIMSS, and other ILSAs to symbolically provide an international stamp of approval or quasi-certification for (national) public school systems goes a long way in today’s attack on public education worldwide.

Creating a Demand for ILSAs It is important to bear in mind that the growth of ILSAs has only been a new development in recent decades; in fact, many countries have traditionally monitored education progress through national learning assessments, especially at the basic education level targeting mathematics and literacy achievement (Kamens & McNeely, 2010). Early attempts to use standardized methods of measurement and systematically compare and exchange data on student performance between the United States and European countries came abruptly to an end with World War II (Lawn, 2008). Yet, around the world today, more countries are making their students participate in ILSAs than ever before. Most recently, more than 600,000 15-yearolds in more than 75 economies took part in PISA 2018, while over 60 countries/ territories tested TIMSS 2019. This spectacular growth in the ever-expanding radius of ILSA can be observed concurrently in terms of participant systems, age groups, and target uses and is attributable, in part, to supply side practices that induce ILSA participation. Take PISA for instance, the OECD has been actively creating new “gateway” trial participation modalities, by first allowing non-OECD countries (partner countries) to enter and then admitting select “economies” or municipalities (e.g., Baku-Azerbaijan, Miranda-Venezuela, Shanghai-China) to participate on behalf of entire countries. Particularly in Shanghai-China, Liu (2019) finds that OECD’s gateway participation strategy, by differentiating “economies” or municipalities from entire countries, effectively minimized risks of underperformance for new participants that could have otherwise prevented or delayed entry. Such induced participation has had considerable impact on the organization and operation of schooling in many countries, as Verger, Fontdevila, and Parcerisa (2019) have shown; countries are increasingly pressed by the global spread of ILSAs and testbased accountability reforms to test students more frequently in schools. Once countries commence ILSA participation, they also embark on a process of internalizing the rhetoric of ILSA logic, that is, the preoccupation with attempting to commensurate different aspects of learning into measurable, calculable, and rankable domains (Gorur, 2016). Under a concerted discourse of assessment, accountability, and competitiveness, countries are morphed to believe that ILSA and testing hold essentialism for evidence-based policy-making in education (Carnoy & Rothstein, 2013; Sellar & Lingard, 2013). Consequently, an opportune moment is created for global actors to sell their tests for an ever-increasing number of subjects, age levels, and educational systems. To illustrate its prevalence, PISA-D (PISA for Development), PISA-S (PISA for Schools), and “PISA for 5-year-olds” (International Early Learning and Child Well-being Study) are now administered every 3 years worldwide.

4

Reasons for Participation in International Large-Scale Assessments

63

However, contrary to ILSAs’ claim, the emergence of these new products is in no way “demand-led” or in any way a response to international community needs, but instead, it echoes the desire to establish a global testing culture. As in the case of Cambodia, the country had no interest in participating in PISA-D but was persuaded to only as result of rebranding assessment as a human right, particularly by merging concepts of access and learning and blurring the lines between human development and human capital (Auld et al. 2019). To add, it is also observed that through subnational PISA-S participation, schools are not only benchmarked in reading, mathematics, and science against international schooling systems but also wholesaled best practice examples of what they are expected to do (Lewis, 2017). As Lewis (2018) theorizes, this “present-ing of the future” approach effectively channels governments’ burgeoning anxiety through constantly looking backward, hyperpositivistically decontextualizing system differences, and rendering such differences equivalent to a malleable unfulfilled future potential that can be addressed with a prescribed list of policy interventions. Moreover, “PISA for 5-year-olds” represents the OECD’s most recent strategy to determine a “best” form of early childhood education, advocate for a singular and universal truth that is offered as technically objective, and practically reshape what it means to be a child while denying alternative values, perspectives, and approaches (Delaune, 2019). Subsequently, this proliferation of international legitimation and normative emulation ensures governments’ sustained interest for participation in these tests. More critically, the recent surge in ILSA participation and emergence of ILSArelated products are not only due to niche marketing of national competitiveness and skills competency but also clear reflections of strategic narratives that the OECD and the global testing industry have created to ensure continued engagement. To this end, two brand-new developments in ILSA’s global expansion movement that require further examining are (i) the preoccupation with linking test accountability to “education is in crisis” and (ii) new ILSA derivative tools that urge countries to reconsider partial and non-participation in ILSA. On the one hand, there is a growing test accountability rhetoric stemming from the World Bank’s (2017) “global learning crisis” that advocates for anchoring ILSA results to global education monitoring metric. Particularly, the World Bank has become a front-runner in this arena by compiling a new Human Capital Index (HCI) that is derived by applying sophisticated conversion formulas to ILSA results. The World Bank argues that this new ILSA derivative will act as a macro-education performance tracker which keeps governments in check and helps mobilize political incentives to address the “global learning crisis” (World Bank, 2018a). This ILSA derivative “innovation” has several far-reaching and problematic implications: (a) the prominence of HCI represents a significant shift of what educational development means, and (b) the harmonization process embedded in HCI creates a new monopoly of test accountability. Historically, while the rise of ILSAs represents a shift of focus from access to learning, the World Bank’s new ILSA derivative, namely, HCI, is attempting to draw new causal links between student learning and worker productivity. As the World Bank (2018b, p. 59) affirms, the HCI is aimed to quantify “the productivity of the

64

J. Liu and G. Steiner-Khamsi

next generation of workers.” To this end, existing education development movements, such as Education for All (EFA) and Sustainable Development Goal 4 (SDG4), and established development indicators, such as the Human Development Index (HDI), had set education access, equity, and learning as the focal point but not worker productivity. As Verger (2012) observes, policy entrepreneurs often “innovate” by transposing concepts and translating them. In this regard, the emergence of HCI changes what educational development means, alters what purpose it serves, and wholesales the coupling of learning and productivity to a long list of countries. For instance, based on HCI calculations, the World Bank’s (2019a, p. 25) Regional Human Capital Plan for Africa aims to “increase productivity for children born today by 13 percent,” all to be achieved before 2023. Quite literally, the World Bank’s approach effectively transforms the educational aspect of human development, shifting its meaning from “access” to “learning” and now to “worker productivity,” and this emerging trend is increasingly adopted and evident in many influential country briefs and reports, including Pakistan@100: Shaping the Future and Future Drivers of Growth in Rwanda. More importantly, the World Bank intends to strategically use its HCI to promote new measurement initiatives, assess and reevaluate lending programs, and realign political and economic incentives (World Bank, 2018a, p. 10). In its call to action to ending global learning crisis, the World Bank (2019b) estimated country-specific learning gaps using ILSAs such as PISA, TIMSS, International Reading Literacy Study (PIRLS), etc. and emphasized the importance of measuring student learning outcomes as a first step to pushing for educational change. The World Bank’s (2019b, p. 13) underlying message is clear, that “a highly fragmented learning assessment system with significant variation across regions” is undesirable and impedes progress to eliminate learning poverty, therefore, countries must work to harmonize assessment “coverage, comparability, and frequency.” By and large, this is in effect creating a new monopoly of test accountability. By June of 2019, according to the World Bank (2019c, p. 10), 63 countries had endorsed the Bank’s human capital approach, among which Uzbekistan and Mongolia had pledged to participate in future waves of PISA, the Philippines agreed to restart its participation in TIMSS, and Sub-Sahara African countries such as Central African Republic, Cote d’Ivoire, Guinea, Niger, and Nigeria committed commencing early grade testing of reading and mathematics. On the other hand, HCI’s key component – Harmonized Learning Outcomes (HLO) – is based on a flawed ILSA extrapolation methodology that, in effect, penalizes governments that have chosen alternative non-standardized paths for measuring learning. As a brief overview of the HCI calculations, the World Bank synchronizes ILSA results by producing a “learning exchange rate” based on test overlap systems (World Bank, 2018b, p. 8), and extrapolate country (supposed) test performance to an HLO scale ranging 300–625 points. Then, using a given country’s HLO as percentage of a Bank-chosen “full learning” benchmark: 625 TIMSS equivalent points, learning-adjusted years of schooling (LAYS) is calculated to adjust for the amount of learning that occurs during the expected years of schooling in a given country.

4

Reasons for Participation in International Large-Scale Assessments

65

Table 1 Human Capital Index and its education component, select countries and economies

Countries/ economies Singapore South Korea Japan ... Kosovo Guyana Ghana

Human Capital Index 0.88 0.84

Harmonized learning outcomes 580.9 563.1

Learningadjusted years of schooling (1) 12.9 12.2

Expected years of schooling (2) 13.9 13.6

Difference: (1)–(2) 1.0 1.3

0.84 ... 0.56 0.49 0.44

563.4 ... 374.8 346.4 307.3

12.3 ... 7.7 6.7 5.7

13.6 ... 12.8 12.1 11.6

1.3 ... 5.1 5.4 5.9

Adapted from Liu and Steiner-Khamsi (2020)

According to the World Bank (2018a), this computed result should reflect the “productivity” of education systems, relative to the benchmark case. In Table 1, we present descriptive information for a few education systems with the smallest gap and those with the largest gap between LAYS and expected years of schooling, as indicated in the final column of Table 1. As an example, in 2015, Singapore’s HLO score was 581 points, which signifies that approximately 93 percent of all potential learning was realized, relative to the 625-point benchmark. In other words, while Singaporean students are expected to attend 13.9 years of school, the World Bank (2018a) calculations indicate that only 12.9 years of schooling were effective learning. In contrary, students in Ghana are estimated to realize only 5.7 years of their potential learning when the schooling expectation is 11.6 years; in other words, more than half of their time spent at school is considered to be unproductive by this new metric. The central assumption in this extrapolation is that ILSAs are inherently comparable, and more specifically, all variation in score distributions on each test “can be attributed entirely to the assessments” (World Bank, 2018c, p. 10). Of course, this claim is highly implausible, and Liu and Steiner-Khamsi (2020) summarize at least four reasons why ILSAs are systematically incomparable: (a) differences by design feature, (b) differences by sampling approach, (c) differences by nonschool influence, and (d) differences in measurement tools. First, ILSAs and regional assessments are run by different testing agencies and vary in objective, domain, and design (Lietz et al. 2017). As case in point, there exist two distinct classes of test programs, one that primarily focuses on evaluating skills, literacy, and competencies (e.g., PISA, PIRLS) and another that assesses learning outcomes that are more closely aligned with curriculum input (e.g., IMSS, SACMEQ). When test objective, domain, and design differences are compounded together, they introduce substantial uncertainty on inferences. Second, participating countries, target population, and sampling approach vary drastically by test program (Rutkowski & Rutkowski, 2018). For instance, PISA includes primarily a large cluster of the world’s most prominent economies (OECD

66

J. Liu and G. Steiner-Khamsi

members and invited OECD partner systems), whereas TIMSS and PIRLS participants include many developing countries, and regional tests such as LLECE, SACMEQ, and EGRA are geographically exclusive. Third, the salience of school quality and nonschool factors varies considerably across systems (Chudgar & Luschei, 2009; Liu, 2016). There is a large set of studies indicating that ILSA score differences result from a combination of underlying family-, cultural-, and measurement-related factors that are arguably independent of school quality. For instance, child development conditions at home vary substantially across systems, especially family educational expectations and private education spending. Finally, there are considerable differences in measurement technology and technical standards among test programs (Jerrim, 2016). To this end, substantial differences in results exist between “paper-and-pencil” and computeradministered tests. All students on PISA 2018 participated in computer-administered tests, while TIMSS will only gradually transition into computer-based tests beginning in 2019. On the empirical end, Liu and Steiner-Khamsi (2020) examine the validity of the World Bank’s ILSA harmonization exercise and show how it is entrenched in siloed methodology. For one, it is shown that test overlap systems are drastically different from those that are more selective in test participation (see Table 2). For instance, Botswana, which is the only test overlap country in SACMEQ 2013 and Table 2 Comparing test overlap and non-overlap country profiles, select economies

List of economies SACMEQ 2013 and TIMSS 2011 (fourth grade) linking Test Botswana overlap SACMEQ Kenya, Lesotho, Mauritius, Malawi, 2013 Only Namibia, Seychelles, Uganda, Zambia, Zimbabwe TIMSS Azerbaijan, Bahrain, Botswana, Canada, 2011 Only Chile, Chinese Taipei, Cyprus, Egypt, Georgia, Hong Kong, Hungary, Iran, Ireland, Israel, Italy, Japan, Jordan, Kazakhstan, Kuwait, Lebanon, Lithuania, Malaysia, Malta, Morocco, New Zealand, Norway, Oman, Qatar, Russian Federation, Saudi Arabia, Serbia, Singapore, Slovenia, Sweden, South Africa, South Korea, Thailand, Turkey, USA, United Arab Emirates Adapted from Liu and Steiner-Khamsi (2020)

Average GDP per capita (PPP 2015)

Average formal schooling (years): pre-primary and primary

15,357

10

7530

9.05

21,411

8.45

4

Reasons for Participation in International Large-Scale Assessments

67

TIMSS 2011, is twice as wealthy and provides 1 additional year of formal primary education, compared to “SACMEQ Only” and “TIMSS Only” systems. For another, findings indicate that even within the same test system and in the same test year, different ILSAs choose drastically different test samples (see Table 3). As case in point, target population coverage can vary as much as 42 percent and 32.7 percent in Buenos Aires-Argentina and Lebanon, respectively (Table 3, Panel A), while sample average age can differ at least 1 year, up to 2.2 years in the same county, same test year, across different test programs (Table 3, Panel B). More worryingly, new penalties that coerce partial and non-participation in ILSAs are generated as inevitable by-products of HCI’s flawed methodology. Liu and Steiner-Khamsi (2020) regress country HLO scores on test participation type, to discern the relationship between ILSA participation and HLO scores. Importantly, it is shown that test participation type alone accounts for about 58 percent of the variation in HLO scores. It is especially troubling to see that the majority of these partial- and non-participants are low- and lower-middle income countries. After controlling for test year and region fixed effects, which accounts for year- and region-varying observable and unobservable factors, as well as country-level per capita income, score penalties associated with partial ILSA participation continue to hold and equate to at least 1 full year of learning. This will inevitably create incentive realignment on countries’ choices as to how student learning outcomes are measured in the medium-to-long term. In an era of global monitoring of education development, these “methodological glitches” have far-reaching political, social, and economic consequences. On a key note, it should be acknowledged that there is no direct relationship between testing learning outcomes and improving the quality of education in schools. According to the World Bank (2019b), “whole of government” changes at the policy and practice levels are needed to create meaningful impact on learning. Yet, in reality, the introduction of test-based accountability reforms globally has led to a normative testing culture worldwide (Verger et al., 2019). The World Bank’s HCI, and by implication, ILSAs, relies on and in effect exacerbates test-based accountability. This needs to be considered the new “soft power” governance tool par excellence, because it penalizes systems that choose not to participate in ILSAs. This outcomes- and output-based reform movement is worrying, not only because it burdens students, teachers, and school physically and mentally with tests and rankings but also because it is the most visible signposts of the market model in education. Such a reduced education worldview is part and parcel of neoliberal reforms and encourages businesses to enter the education sector first as providers of goods and services (Robertson & Verger, 2012; Steiner-Khamsi & Draxler, 2018) and more recently also as policy advisors (Lubienski, 2019). The reliance on outcomes measurement, coupled with the globalization of the knowledge-based economy, explains the sharp rise of interest to “govern by numbers” (Grek, 2008), that in itself implies the coercive aspect in the current state of global monitoring, evaluation, and international comparisons using ILSAs.

Adapted from Liu and Steiner-Khamsi (2020)

Panel A | Target population coverage (%) PISA 2015 Economies (1) Buenos Aires-Argentina 55 Lebanon 66 Thailand 71 ... ... Malta 98 Russian Federation 95 Sweden 94 TIMSS 2015 (2) 97.3 98.7 99.8 ... 96.5 96.3 94.5 Difference |(1)–(2)| 42.3 32.7 28.8 ... 1.5 1.3 0.5

Panel B | Sample mean age (years) PISA 2015 Economies (3) Georgia 15.9 Jordan 15.9 Norway 15.8 ... ... Lithuania 15.8 Russian Federation 15.8 Sweden 15.7

Table 3 Comparing PISA 2015 and TIMSS 2015 (eighth grade) samples, select economies TIMSS 2015 (4) 13.7 13.8 13.7 ... 14.7 14.7 14.7

Difference |(3)–(4)| 2.2 2.1 2.1 ... 1.1 1.1 1

68 J. Liu and G. Steiner-Khamsi

4

Reasons for Participation in International Large-Scale Assessments

69

Conclusion Strikingly, studies on reasons for participation in ILSAs and on the impact of ILSA reception have generated a discursive shift in globalization studies, bringing to the foreground the national dimension. The preoccupation with what league leaders (Finland, Shanghai, Singapore, etc.) have “done right” has generated a new momentum for policy borrowing research. Precisely at a moment in policy borrowing research when scholars have put the study of cross-national policy attraction to rest and instead directed their attention to the ubiquitous diffusion processes of global education policies in the form of “best practices” or “international standards,” the cross-national dimension – and by implication the focus on the nation-state and its national policy actors – has regained importance in ILSA policy research. In the case of PISA, the preoccupation of national policy actors is, at least rhetorically, on how their own system scores as compared to others and what there is to “learn” from the league winners, league slippers, and league losers, in terms of PISA’s twenty-first-century skills. Because policy actors often attribute “best practices” to particular national educational systems, the national level regained importance as a unit of analysis. In a similar vein, the national politicians and education authorities are held accountable for the test results regardless of how decentralized or “polycentric” the governance of the sector is (Cairney, 2016). We are now in the awkward position of having to bring back the focus on national systems, a unit of analysis that had been criticized as “methodological nationalism,” because, if used naively, it is cause for concern because of its homogenizing effects (see Wimmer & Schiller, 2003). However, the great return of the national in an era of globalization is only at first sight counterintuitive. Without doubt, the role of the state for providing public education has diminished (see Ball, 2018). However, such an assertion only partially captures the impact of neoliberal reforms of the past three decades on the authority of the state. It is more accurate to contend that the role of the state diminished and changed. In the education sector, the change implied a new role for the state, new ways of regulating the education system, and new tools for generating or alleviating reform pressure. The neoliberal reforms of the 1980s and 1990s were undertaken with the rhetoric of breaking the “state monopoly” and instead using “market forces” to improve the quality of public education and cutting inefficiency in the “state bureaucracy.” Regardless of whether the public education system was high or low performing, governments were under political pressure to selectively borrow new public management policies that encouraged non-state actors such as businesses, churches, communities, and families to open and operate schools with funding from public resources. Within a short period of time, the governments scaled back the role of the state in education from one in which it was at the same time provider and regulator to one in which it could withdraw to being only a regulator by way of standard setting and periodical monitoring of learning outcomes. Standard setting, monitoring – including self-monitoring – and benchmarking became the key governance tools.

70

J. Liu and G. Steiner-Khamsi

In education, the outcomes orientation of the new public management reform triggered a proliferation of standardized student assessment. The tests have, for a variety of reasons, been utilized as the primary monitoring tool for governments to assess the quality of teachers, schools, districts, and the education system and to make policy decisions based on these standardized assessments. The shift from government to governance has not only fueled a governance by numbers but also required from governments that they engage in “network governance” (Ball & Junemann, 2012) in which non-state actors, including education businesses, are not only seen as providers of goods and services but also as key partners in the policy process. The empowerment of non-state actors in the new millennium, including the private sector as well as transnational regimes (such as ILSAs and HCIs), as key policy actors has been interpreted as a clear sign of the “disarticulation and diversification of the state system” (Ball & Junemann, 2012, p. 24) which neoliberal reforms of the past century intended to achieve. Without doubt, this broader policy context matters for understanding the exponential boom of ILSAs. It is a context in which the state governs by means of outcomes measurement and data circulation (Piattoeva et al., 2018). This chapter attempted to show that, from the perspectives of politicians and policy makers, there are several benefits to participating in ILSAs. OECD’s PISA, in particular, enables decision-makers to justify their political decisions by (i) referring and interpreting numbers in line with their reform agendas, (ii) drawing on a credible source (OECD) in an era marked by a surplus of evidence, (iii) keeping the national sanctuary – the national curriculum framework – intact while subscribing to a competency-based curriculum reform, and finally (iv) using participation in PISA as a symbol for transnational accreditation in an educational space that is inhabited by middle- and upper-income countries. At the same time, international organizations have also put mechanisms in place that ensure the sustained interest in participating in ILSAs. Among these strategic measures, discussed in the chapter, are (i) the preoccupation with linking test accountability to “education is in crisis” and (ii) development of new ILSA derivative tools that urge countries to reconsider partial- and non-participation in ILSA. Of course, in this process, coercive measures are not excluded. Finally, in the age of ILSA expansion, it requires caution that while there is a surplus of assessment, the predominant policy advice remains singular, that countries must standardize learning measurement and act to intervene in education, in order to improve productivity of its future workforce.

References Addey, C., Sellar, S., Steiner-Khamsi, G., Lingard, B., & Verger, A. (2017). Forum discussion: The rise of international large-scale assessments and rationales for participation. Compare, 47(3), 434–452. https://doi.org/10.1080/03057925.2017.1301399 Auld, E., Rappleye, J., & Morris, P. (2019). PISA for Development: How the OECD and World Bank shaped education governance post-2015. Comparative Education, 55(2), 197–219.

4

Reasons for Participation in International Large-Scale Assessments

71

Ball, S. J. (2018). The tragedy of state education in England: Reluctance, compromise and muddle – A system in disarray. Journal of the British Academy, 6, 207–238. https://doi.org/10.5871/jba/ 006.207 Ball, S. J., & Junemann, C. (2012). Networks, new governance and education. University of Bristol and Policy Press. Bendix, R. (1978). Kings or people: Power and the mandate to rule. University of California Press. Cairney, P. (2016). The politics of evidence-based policy making. Palgrave. Carnoy, M., & Rothstein, R. (2013). What do international test scores really show about U.S. student performance? Economic Policy Institute. Chudgar, A., & Luschei, T. F. (2009). National income, income inequality, and the importance of schools: A hierarchical cross-national comparison. American Educational Research Journal, 46(3), 626–658. Delaune, A. (2019). Neoliberalism, neoconservativism, and globalisation: The OECD and new images of what is ‘best’ in early childhood education. Policy Futures in Education, 17(1), 59–70. Espeland, W. (2015). Narrating numbers. In R. Rottenburg, S. E. Merry, S.-J. Park, & J. Mugler (Eds.), The world of indicators. The making of governmental knowledge through quantification (pp. 56–75). Cambridge University Press. Gorur, R. (2015). Producing calculable worlds: Education at a glance. Discourse: Studies in the Cultural Politics of Education, 36(4), 578–595. Gorur, R. (2016). Seeing like PISA: A cautionary tale about the performativity of international assessments. European Educational Research Journal, 15(5), 598–616. Grek, S. (2008). From symbols to numbers: The shifting technologies of education governance in Europe. European Educational Research Journal, 7(2), 208–218. Hartmann, E. B. (2016). Education outside the public limelight: The ‘parallel universe’ of ICT certifiers. In A. Verger, C. Lubienski, & G. Steiner-Khamsi (Eds.), World yearbook of education 2016, the global education industry (pp. 228–247). Routledge. Jerrim, J. (2016). PISA 2012: How do results for the paper and computer tests compare? Assessment in Education: Principles, Policy & Practice, 23(4), 495–518. Kamens, D. H., & McNeely, C. L. (2010). Globalization and the growth of international educational testing and national assessment. Comparative Education Review, 54(1), 5–25. Karseth, B., Sivesind, K., & Steiner-Khamsi, G., eds. (2021; forthcoming). Evidence and expertise in Nordic education policies: A comparative network analysis from the Nordic region. Labaree, D. F. (2014). Let’s measure what no one teaches: PISA, NCLB, and the shrinking aims of education. Teachers College Record, 116(9), 1–14. Lawn, M. (2008). An Atlantic crossing? The work of the international examination inquiry, its researchers, methods and influence. Symposium. Lewis, S. (2017). Governing schooling through ‘what works’: The OECD’s PISA for Schools. Journal of Education Policy, 32(3), 281–302. Lewis, S. (2018). PISA ‘Yet To Come’: Governing schooling through time, difference and potential. British Journal of Sociology of Education, 39(5), 683–697. Lietz, P., Cresswell, J. C., Rust, K. F., & Adams, R. J. (Eds.). (2017). Implementation of large-scale education assessments. John Wiley & Sons. Liu, J. (2016). Student achievement and PISA rankings: Policy effects or cultural explanations? In W. Smith (Ed.), The global testing culture: Shaping education policy, perceptions, and practice (Oxford studies in comparative education) (Vol. 25, pp. 85–99). Symposium Books. Liu, J. (2019). Government, media, and citizens: Understanding engagement with PISA in China (2009–2015). Oxford Review of Education, 45(3), 315–332. Liu, J., & Steiner-Khamsi, G. (2020). Human Capital Index and the hidden penalty for non-participation in ILSAs. International Journal of Educational Development, 73, 102149. Lubienski, C. (2019). Advocacy networks and market models for education. In M. P. do Amaral, G. Steiner-Khamsi, & C. Thompson (Eds.), Researching the global education industry (pp. 69–86). Palgrave.

72

J. Liu and G. Steiner-Khamsi

Luhmann, N. (1990). Essays on self-reference. Columbia University Press. Nóvoa, A., & Lawn, M. (2002). Fabricating Europe. Springer Netherlands. Novoa, A., & Yariv-Mashal, T. (2003). Comparative research in education: A mode of governance or a historical journey? Comparative Education, 39(4), 423–438. Piattoeva, N., Gorodski Centeno, V., Suominen, O., & Rinne, R. (2018). Governance by data circulation? The production, availability, and use of national large-scale assessment data. In J. Kauko, R. Rinne, & T. Takala (Eds.), Politics of quality in education. A comparative study of Brazil, China, and Russia (pp. 115–135). Palgrave. Robertson, S. L., & Verger, A. (2012). Governing education through public private partnership. In S. L. Robertson, K. Mundy, A. Verger, & F. Menashy (Eds.), Public private partnership in education. New actors and modes of governance in a globalizing world (pp. 21–42). Edward Elgar. Rutkowski, L., & Rutkowski, D. (2018). Improving the comparability and local usefulness of international assessments: A look back and a way forward. Scandinavian Journal of Educational Research, 62(3), 354–367. Schriewer, J. (1990). The method of comparison and the need for externalization: Methodological criteria and sociological concepts. In J. Schriewer & B. Holmes (Eds.). Theories and methods in comparative education (pp. 25–83). Frankfurt/M, Germany: Lang. Schriewer, J., & Martinez, C. (2004). Constructions of internationality in education. In G. SteinerKhamsi (Ed.), The global politics of educational borrowing and lending (pp. 29–52). Teachers College Press. Sellar, S., & Lingard, B. (2013). The OECD and global governance in education. Journal of Education Policy, 28(5), 710–725. Silova, I. (2006). From sites of occupation to symbols of multiculturalism. Reconceptualizing minority education in post-Soviet Latvia. Information Age Publisher. Steiner-Khamsi, G. (2021). Externalisation and structural coupling: Applications in comparative policy studies in education. European Educational Research Journal, https://doi.org/10.1177/ 1474904120988394 (forthcoming). Steiner-Khamsi, G., & Draxler, A. (2018). Introduction. In G. Steiner-Khamsi & A. Draxler (Eds.), The state, business, and education: Public-private partnerships revisited (pp. 1–15). E. Elgar. Steiner-Khamsi, G., & Stolpe, I. (2006). Educational import in Mongolia: Local encounters with global forces. Palgrave Macmillan. Steiner-Khamsi, G., & Dugonjic-Rodwin, L. (2018). Transnational accreditation for public schools: IB, PISA and other public-private partnerships. Journal of Curriculum Studies, 50(5). https:// doi.org/10.1080/00220272.2018.1502813 Steiner-Khamsi, G., & Waldow, F. (Eds.). (2018). PISA for scandalisation, PISA for projection: The use of international large-scale assessments in education policy making. Half special issue of Globalisation, Societies and Education, 16(5), 557–565. Verger, A. (2012). Framing and selling global education policy: The promotion of public-private partnerships for education in low-income contexts. Journal of Education Policy, 27(1), 109–130. Verger, A., Fontdevila, C., & Parcerisa, L. (2019). Reforming governance through policy instruments: How and to what extent standards, tests and accountability in education spread worldwide. Discourse: Studies in the Cultural Politics of Education, 40(2), 248–270. Waldow, F. (2016). Das Ausland als Gegenargument: Fünf Thesen zur Bedeutung nationaler Stereotype und negativer Referenzgesellschaften [Foreign countries as a counterargument: Five propositions regarding the significance of national stereotypes and negative reference societies]. Zeitschrift für Pädagogik, 62(3), 403–421. Waldow, F. (2019). Introduction: Projection in education policy-making. In F. Waldow & G. Steiner-Khamsi (Eds.), Understanding PISA’s attractiveness: Critical analyses in comparative policy studies. Bloomsbury. Waldow, F., & Steiner-Khamsi, G. (Eds.). (2019). Understanding PISA’s attractiveness: Critical analyses in comparative policy studies. Bloomsbury.

4

Reasons for Participation in International Large-Scale Assessments

73

Wimmer, A., & Schiller, N. G. (2003). Methodological nationalism, the social sciences, and the study of migration: An essay in historical epistemology. International Migration Review, 37(3), 576–610. World Bank. (2017). World development report 2018: Learning to realize education’s promise. World Bank. World Bank. (2018a). The human capital project. World Bank. World Bank. (2018b). Measuring human capital. World Bank. World Bank. (2018c). Global dataset on education quality: A review and update (2000–2017). World Bank. World Bank. (2019a). Regional human capital plan for Africa. World Bank. World Bank. (2019b). Human capital project: First year annual progress report. World Bank. World Bank. (2019c). Ending learning poverty: What will it take? World Bank. Ydesen, C. (2019). The formation and workings of a global education governing complex. In C. Ydesen (Ed.), The OECD’s historical rise in education, the formation of a global governing complex (pp. 291–304). Palgrave.

5

Educational Accountability and the Role of International Large-Scale Assessments Susanna Loeb and Erika Byun

Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evidence-Based Educational Policymaking and Outcomes-Based Accountability . . . . . . . . . . . . . A Brief Overview of the History and Purpose of ILSAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Administrative Accountability and the Role of ILSAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Framework for Considering the Potential for ILSAs in Accountability Systems . . . . . . . . . . . . Research on Administrative Accountability and the Role of ILSAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

76 77 79 81 82 86 89 89

Abstract

As nations across the world have experienced globalization and an increasingly competitive economy, calls for governments to monitor public sector functions, including education systems, have grown. As a result, educational evidencebased policymaking and outcomes-based accountability have emerged in recent decades as global phenomena. Evidence-based policymaking is the use of information, such as program effectiveness or differential achievement of different groups of students, to make policy decisions. Outcomes-based accountability is a specific type of evidence-based policymaking that uses data on the effectiveness of each unit in a given system – such as each school in a school system or each teacher in a school – to make decisions about that particular unit. Though accountability systems have been a controversial approach for providing consumers – policymakers, parents, and the public – with information needed to S. Loeb (*) Harvard Kennedy School, Cambridge, MA, USA e-mail: [email protected] E. Byun The Wharton School, Philadelphia, PA, USA e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_6

75

76

S. Loeb and E. Byun

determine how well public education systems are educating their students, they have a long-standing presence in many nations. In the United States, outcomesbased accountability has played a core role in federal education policies following the introduction of state accountability systems in the 1990s, and a host of other countries have had centralized reporting of national standardized exams for several decades (Mizala et al., J Dev Econ 84(1):61–75, 2007). Keywords

Education · Schools · Accountability · Policy · International

Introduction As nations across the world have experienced globalization and an increasingly competitive economy, calls for governments to monitor public sector functions, including education systems, have grown. As a result, educational evidence-based policymaking and outcomes-based accountability have emerged in recent decades as global phenomena. Evidence-based policymaking is the use of information, such as program effectiveness or differential achievement of different groups of students, to make policy decisions. Outcomes-based accountability is a specific type of evidence-based policymaking that uses data on the effectiveness of each unit in a given system – such as each school in a school system or each teacher in a school – to make decisions about that particular unit. Though accountability systems have been a controversial approach for providing consumers – policymakers, parents, and the public – with information needed to determine how well public education systems are educating their students, they have a long-standing presence in many nations. In the United States, outcomes-based accountability has played a core role in federal education policies following the introduction of state accountability systems in the 1990s, and a host of other countries have had centralized reporting of national standardized exams for several decades (Mizala et al., 2007). While countries commonly use intranational tests for accountability, international large-scale assessments (ILSAs) have emerged as another source of information for decision-making. Only a small number of nations and economies participated in ILSAs when they first emerged in 1964. Today, a broad swathe of countries across the world partakes in ILSAs, and a variety of stakeholders – policymakers, researchers, practitioners, the community, and parents – use their results. The question remains whether ILSAs, as they are currently designed and used, can help improve education decision-making and, ultimately, student learning. In what follows, we first set the context by describing the emerging interest in evidence-based policymaking and outcomes-based accountability. We then examine the growth in the use of ILSAs across the globe and show how ILSAs have taken on a role in evidence-based policymaking. Finally, we lay out a framework to

5

Educational Accountability and the Role of International Large-Scale Assessments

77

help determine the usefulness of ILSAs for accountability purposes and offer examples of research using ILSAs to assess whether information from ILSAs can lead to educational improvement. No measure is perfect – thus it is up to its users to understand whether the measures are valid, reliable, comprehensiveness, and just as importantly, how they compare to other existing alternatives for the purposes in question. Currently, ILSAs are better suited, as well as actually used in practice, to assess the effects of programs and policies, than they are as the underlying measures for outcomes-based accountability systems. Assessments that measure outcomes for all students, can be linked over time to create measures of student learning (change over time), and are more closely tied to jurisdiction-specific standards are likely better suited than ILSAs as measures for outcomes-based accountability systems.

Evidence-Based Educational Policymaking and Outcomes-Based Accountability Evidence-based educational policymaking has become a popular practice worldwide, despite mixed views on whether the use of “evidence-based” research is beneficial in determining “best practices” (Wiseman, 2010; Nutley et al., 2007; Pawson, 2006; Sanderson, 2006). The Evidence-Based Policymaking Collaborative defines “evidence-based” as “prioritizing rigorous research findings, data, analytics, and evaluation of new innovations” (Evidence-Based Policymaking Collaborative, 2016). Wiseman (2010) contends that policymakers use evidence to inform their decisions for three possible reasons: to find the most effective solutions to educational issues; to move their social, economic, and political agendas forward; and to align themselves with the institutionalized legitimacy of evidence-based educational policymaking. He argues that the legitimized reasons for evidence-based educational policymaking include ensuring quality (i.e., determining whether students are learning what they should be learning in schools), equality (i.e., examining who has high academic performance and who does not), and control (i.e., using evidence to control various school inputs such as funding and curricula). Theories explaining the growing popularity of evidence-based educational policymaking tie the phenomenon to increased calls to measure the outputs of the public sector. These calls arose in response to economic trends such as globalization and stagflation that placed pressure on governments to document how their public sector, including education systems, used their resources. For example, in 1983 the US Nation at Risk report revealed how other nations were “matching and surpassing [America’s] educational attainments” and called for a complete overhaul of US schools (A Nation at Risk, 1983). America was not alone, however, in experiencing the pressure to publicly manage and monitor its education system in the context of an increasingly competitive and global economy. The shift toward public management of education systems spread across the globe, reflected in the emergence of concepts such as the Global Education Reform Movement (GERM) (Sahlberg, 2016) and New Public Management (NPM) (Gunter et al., 2016).

78

S. Loeb and E. Byun

Accordingly, outcomes-based accountability rose to the forefront of conversations on how jurisdictions can provide oversight over education systems. Accountability, in the general sense, refers to the frameworks, approaches, policies, and actions that help hold those with responsibilities to high standards of performance. Accountability can take various forms. For example, school boards, school choice, and regulations are all forms of political, market-based, and administrative accountability. When elected officials, particularly in a democracy, decide school policies based on their electorates’ preferences, they are reflecting the political form of accountability. In market-based accountability, families use schools’ performance and quality to select their school for their children. Market-based accountability encourages school choice options, such as charter schools, magnet schools, vouchers, and tuition tax credits. Political and market-based accountability approaches are often not sufficient when educational consumers have unequal political, social, and economic resources as well as unequal access to available information. Central governments can supplement these forms of accountability with outcomes-based accountability, as well as with direct regulations. Outcomes-based accountability in education relies on performance measures such as student standardized test scores or direct observations of school practices and processes, often collected by inspectorates or student, parent, and teacher surveys. With the growing emphasis on showcasing national and international educational achievements across the world (Wiseman & Baker, 2005; Hall, 2005; Wiseman, 2010), accountability has in recent decades centered most on student performance measures. This form of accountability became a critical component of public schools starting in the 1980s, especially in the United States and United Kingdom, alongside efforts to measure public and nonprofit sector performance more generally (Figlio & Kenny, 2009). It emerged from the concern that educators may be incentivized to act against the interests of education stakeholders – parents, the community, or policymakers – as stakeholders have difficulty overseeing school practices (Figlio & Loeb, 2011). School accountability systems aimed to address this problem by providing stakeholders with information on how well their schools, teachers, and districts were performing. Stakeholders then respond to this information – for example, policymakers can change education policies and parents can choose their child’s school based on the information. Accountability can occur on various levels (e.g., students, schools, districts, teachers), assign the responsibility to various actors (e.g., school, district, nation), specify who is entitled to an account (e.g., parents, constituents), and determine for what needs to be accounted (e.g., student or school performance). According to Figlio and Loeb (2011), the United States, the United Kingdom, and Chile have had the most sophisticated outcomes-based accountability systems. In the United States, for example, outcomes-based accountability using student performance measures has deep roots in the public school system since the 1980s, with the emergence of the New Right movement and neoliberal reforms (Smith, 2014; Dorn, 2007). Texas and North Carolina were two of the first states to implement testbased accountability. The No Child Left Behind Act (NCLB) of 2002 brought testbased accountability policies to the federal level; it required states to assess students

5

Educational Accountability and the Role of International Large-Scale Assessments

79

in reading and mathematics from grades 3 through 8 and also mandated that states evaluate their schools based on whether their students are progressing toward 100% proficiency by 2014 (Figlio & Loeb, 2011). Since the passage of the Every Student Succeeds Act (ESSA) in 2015, the onus of school and student performance, along with greater autonomy over standards and accountability, has been moved somewhat back toward the states, but the federal role is still larger than prior to NCLB (Loeb & Byun, 2019). Other countries have started to implement centralized intranational exams for school accountability purposes in the past few decades, though they vary in levels of assessment and reporting qualities. In a study by Woessmann (2001), 15 of 39 countries had some form of centralized exam, meaning exams administered to all students by a body beyond the school level. Since the 1980s, almost all European countries have put into place national testing policies (Eurydice, 2009; Smith, 2014). East Asian nations have also started introducing standardized assessments; for instance, Hong Kong has a criterion-referenced assessment at the school level (Mok, 2007). In Latin America, Brazil, Chile, Colombia, and Mexico have developed substantial assessment capacity (Ferrer, 2006). Exam results are used for teacher, as well as school, accountability, though not to a great an extent. Approximately 50% of students, on average across Organization for Economic Cooperation and Development (OECD) nations, are in schools in which principals reported that student assessments are used to make judgments about teachers’ performance (OECD, 2013a). While the United Kingdom, Sweden, and Turkey use teacher performance as a criterion to decide teachers’ base salary, Denmark, Mexico, Norway, and Finland use teacher performance to decide on supplemental payments (OECD, 2012).

A Brief Overview of the History and Purpose of ILSAs Alongside intranational exams, international large-scale assessments (ILSAs) have taken on a more significant role in outcomes-based accountability for education in light of demands for jurisdictions across the world to measure and quantify their public sector outputs, combined with the increasingly competitive and global economy. ILSAs are international assessments that aim to collect comparable evidence of performance on a range of domains such as reading, science, mathematics, and English as a foreign language. Examples of ILSAs targeting students include Programme for International Student Assessment (PISA), Trends in International Math and Science Study (TIMSS), and Progress in International Reading Literacy Study (PIRLS). Regional ILSAs include the Latin American Regional Comparative and Explanatory Study (ERCE) and Southern African Consortium for Monitoring Educational Quality (SACMEQ). ILSAs often report scores across countries by content and competency (Lockheed & Wagemaker, 2013). Substantial bodies of literature outline how jurisdictions have been using ILSAs. When ILSAs first emerged, nations primarily relied on them as a monitoring and

80

S. Loeb and E. Byun

“thermometer” tool (Robitaille et al., 1989; Postlethwaite & Ross, 1992; Lockheed & Wagemaker, 2013). Now, the purposes of ILSAs vary widely. Singer, Braun, and Chudowsky (2018) list seven common uses of ILSAs: a tool for transparency, a comparison tool for student achievement within and across nations, a tracking tool for changes in student achievement, a catalyst tool for instigating educational policy reforms, a de facto international benchmarking tool for ranking top-performing countries and economies, an evaluation tool for the effectiveness of instructional practices and policies, and a causal analysis tool for determining the relationship between student achievement and social and economic factors. In general, the consensus among scholars is that ILSAs can contribute to policy transparency and instigating education reforms but may not be well designed for benchmarking through ranking countries’ performances, evaluating the effectiveness of practices and policies, and determining causal relationships (Singer et al., 2018). When ILSAs first emerged in the 1960s, only a select group of jurisdictions (As Singer and Braun (2019) note, “jurisdiction” may be a more accurate term than “country” because some ILSAs allow states, provinces, and cities to participate) used them (Postlethwaite, 1967). The first International Association for the Evaluation of Educational Achievement (IEA) Pilot Study in 1964 consisted of 12 highand upper middle-income countries: Belgium, England, German, Finland, France, Israel, Poland, Scotland, Sweden, Switzerland, the United States, and Yugoslavia. By the 1970s, lower-income countries such as Chile, Hungary, and India also began to participate. The past few decades have experienced an exponential growth in the number of jurisdictions actively utilizing ILSAs, particularly among lower- and middle-income economies (Bloem, 2015; Addey et al., 2017). The OECD reported that 93 jurisdictions participated since PISA started and 79 jurisdictions participated in the PISA in 2018 (https://www.oecd.org/pisa/aboutpisa/pisa-participants.htm). Many scholars hypothesize that ILSAs grew in part due to the shift from their use primarily by researchers to governments. For example, the affiliated members of the IEA, which first conducted ILSAs, changed from majority researchers to governmental agency representatives, while the OECD also involved itself in the administration of the PISA (Pizmony-Levy & Bjorklund, 2014). ILSAs have also emerged in the limelight due to media coverage across the globe. Increasingly, nations and economies rely on a published “league table” of ILSA results that rank average student performances across participating jurisdictions (Singer & Braun, 2018; Feuer et al., 2015). Sensationalized coverage of countries’ performance and ranking often follows the release of ILSA results, typically without providing context of other jurisdiction-specific factors that might come into play. For example, Germany experienced what is now referred to as a “PISA shock” when PISA’s 2000 rankings were published and the “subpar” student performance in Germany misaligned with the country’s own perception of their education system (Baroutsis & Lingard, 2018). After the release of the 2018 PISA and NAEP results in December 2019, a multitude of news articles emerged comparing US flat performance to top-performing jurisdictions such as China, Singapore, Canada, and Finland (Goldstein, 2019; Jacobson). Other articles try to explain the decline in performance of students in various jurisdictions and what they should do about it (Horn, 2020).

5

Educational Accountability and the Role of International Large-Scale Assessments

81

Existing works have theorized causes of the increase in the use of ILSAs. Some scholars frame the popularity of ILSAs as a rational decision made by jurisdictions, based on the advantages and disadvantages of administering ILSAs (Addey et al., 2017). For example, policymakers following the rationalism theory might conclude that the advantages of participating in ILSAs outweigh the disadvantages given their desire for improved education systems, their need for measures of human capital, and the globalization of educational assessment (Braun & Singer, 2019; Kellaghan, 2001; Meyer & Benavot, 2013; Kamens & McNeely, 2010). Other scholars contend that ILSAs’ growth has less to do with rationality and more to do with pursuing legitimacy of their educational systems for external stakeholders regardless of whether there are advantages to relying on ILSAs or not. According to Verger (2017), countries may not actually use ILSAs because they accurately capture the overall performance of their students but because countries perceive pressure from peer institutions to legitimize their education systems. Moreover, ILSAs can help validate their political, economic, and social status across the world. With mainstream media often focusing primarily on reporting league tables of ILSA results (Froese-Germain, 2010), it is not surprising that performance on these tests have become a signal of the legitimacy of education systems around the world. Scholars have also explained the rise of ILSAs in the context of the political economy (Dale, 2005; Steiner-Khamsi, 2010; Rizvi, 2009; Robertson & Dale, 2015; Verger et al., 2016; Addey et al., 2017). Experts argue that economies have become increasingly knowledge-based in a technology-driven world, in that human capital in the form of intellectual capacities has greater value to countries in recent decades (Powell & Snellman, 2004). Accordingly, jurisdictions have started to use ILSA results as a measure to compare the competitiveness of economies and identify how well education systems prepare students for thriving in these economies. In fact, Addey and Sellar (2017) contend that the PISA is specifically designed to assess a jurisdiction’s human capital at the completion of compulsory education so that countries can determine policies for producing the most competitive workers. Given the pressure that nations feel to establish their global economic, political, and social status as well as to ensure that their education systems are training and producing competitive workers, jurisdictions have started to rely more on ILSAs to provide them with outcome measures with which to judge the performance of their school system.

Administrative Accountability and the Role of ILSAs In recent decades, countries have not only been relying on intranational assessments but also on international large-scale assessments (ILSAs) to better understand their system’s performance and particularly the extent to which education decisions are evidence-based. The PISA questionnaire showcases this trend through the increase in the number of items in the questionnaire related to assessment and accountability (Teltemann & Jude, 2019). For example, the 2006 PISA questionnaire asked whether schools posted achievement data publicly or tracked data over time using an

82

S. Loeb and E. Byun

administrative authority. In comparison, the 2015 PISA not only asked about these two accountability procedures but also about whether schools used the data to evaluate the principal’s and teachers’ performances, make decisions about instructional resource allocation at the school based on achievement data, or provided achievement data directly to parents. A 2013 OECD report on whether tracking achievement data over time or posting assessment data publicly was a more widespread practice of using assessment results for accountability purposes found that tracking over time was much more common, particularly in the United States, the Netherlands, the United Kingdom, Sweden, and New Zealand (OECD, 2013b). Measures from ILSAs can be used for accountability in a variety of ways. If publicized, ILSAs can provide information that can help mobilize the public around addressing needs in schools. Policymakers can also use them punitively to set explicit rewards and/or sanctions based on performance on the ILSAs. For example, central authorities have used PISA scores to reward schools, and consequently their principals and teachers, with above average performance on the test (OECD, 2013b). ILSA measures can also instigate sanctions including but not limited to less autonomy, more reporting requirements, and more alternative school options through school choice or additional services. For example, declining performance in Ireland on the 2006 and 2009 PISAs, despite an independent review showing that issues with the international calibration of assessment items and student disengagement from the PISA may have exaggerated the decline, led to changes in curriculum, teacher education, and the inspectorate process (Braun & Singer, 2019). However, while ILSA in some cases have been used as outcome measures for accountability, these tests have drawbacks for this purpose.

A Framework for Considering the Potential for ILSAs in Accountability Systems ILSAs are not unlike intranational assessments in that their usefulness for administrative accountability depends on their design. Tests that are well-designed are better able to fulfill their intended purpose – whether the purpose is to provide information to consumers such as parents or homeowners in a market-based model and to policymakers and voters in a political model or to hold schools directly accountable for their students’ performance. If the test is not designed well for its purpose, it may not measure what it is designed to measure (i.e., validity), it may not measure its underlying constructs precisely (i.e., reliability), or it may not capture all domains of interest (i.e., comprehensiveness). The alignment between purpose and design is important. For example, even if a measure accurately captures student learning, it may not shed light on schools’ contribution to this learning. Many of ILSAs’ validity, reliability, and comprehensiveness considerations overlap with those of intranational tests. However, ILSAs pose an added complexity because they need to be well designed for and applicable across a variety of jurisdictions with different educational systems. The different educational systems, combined with varying economic, social, and political contexts, may yield additional

5

Educational Accountability and the Role of International Large-Scale Assessments

83

concerns with the use of ILSAs for test-based accountability. Moreover, many intranational assessments include all students in all schools and are able to follow students over time, allowing for the creation of measures of individual student learning. ILSAs in contrast are generally cross-sectional samples of students within samples of schools. The less inclusive sampling limits its validity for measuring growth or learning. In this section, we dive deeper into the validity, reliability, and comprehensiveness considerations for all tests and in particular ILSAs. Validity refers to whether the test actually measures what it is designed to measure. In other words, is the test valid for the question at hand? Across all types of tests – from intranational tests like state assessments and the National Assessment of Educational Progress (NAEP) in the United States to international tests like the PISA – validity often denotes whether the test is measuring how individual students are performing. For tests used specifically for accountability purposes, however, the goal and construct of the tests’ measures are particularly important considerations. One particular goal might be to measure how much students learn science skills, while another goal might be to measure how much teachers or schools contribute to students’ learning of science skills, controlling for other factors out of the hands of the school. At the same time, different consumers – parents in market models who are choosing the schools for their children, school administrators and boards who want to determine the effectiveness of their schools in building the capacities of their students, and policymakers in political models who aim to improve the nation’s education system – might be using the information. School administrators might be using the test measures to determine how the students in their particular schools are performing. Education policymakers’ goal might be to understand whether their schools are as effective as other schools. If the goal and construct of the test do not align, the test would not yield valid inferences for accountability from the measures. While all of these validity aspects hold for both intranational tests and ILSAs used for accountability purposes, large-scale international tests require additional validity checks. A body of research addresses this issue, and many studies (including PISA, ERCE, and TALIS) report tests of cross-cultural comparability (see, e.g., Eryilmaz et al., 2020; Isac et al., 2019; Sandoval-Hernandez et al., 2019; Rutkowski & Rutkowski, 2018; Rutkowski & Svetina, 2017; Desa, 2014). Cross-cultural differences not just by country but also by region can affect how the test questions are read and interpreted by students. Translating ILSA results into various languages may also impact the validity of such measures. The sample of students taking the ILSA also may not be comparable across jurisdictions and may not represent the overall performance of the country. For example, age ranges for grades may vary across educational systems. ILSAs often allow participating jurisdictions to exclude 5% or less of their student population for various reasons, spanning from disability to language proficiency issues. According to Carnoy (2015), many ILSAs have “focus” years for particular subjects; for example, PISA’s “focus” years were 2000 and 2009, in which the majority of students took the test. In 2003, 2006, and 2012, however, only 40% of the student sample actually took the test, which in itself was not identical for all students (Carnoy, 2015). If such differences in samples exist, the ILSA measures may produce inferences that are not valid.

84

S. Loeb and E. Byun

Social, political, and economic circumstances may affect the validity of ILSA measures across all jurisdictions. For example, some jurisdictions may prioritize student performance or have stronger political support for exams compared to other jurisdictions. As a result, students in these jurisdictions may have spent more time in class on test preparation than those in other jurisdictions, to the detriment of other non-cognitive skills. Particularly for ILSAs that are presented in country rankings, treating them as valid measures of what they are designed to measure and making reform decisions based solely on the rankings is a significant concern with the use of these tests for accountability (Torney-Purta & Amadeo, 2013). Relatedly, the number of jurisdictions participating in the ILSA can influence the validity of the test results, especially if stakeholders are comparing results across countries. Unlike intranational tests used for accountability such as US state standardized tests, ILSAs are typically not designed for longitudinal studies of schools and students. Most ILSAs are not administered every year and often do not test the same sample of students. Thus, these assessments may not track long-term individual student progress or changes over time well (Braun & Singer, 2019). Traditionally, tests have measured how well schools do based on levels of student performance – referred to as the “status” approach – or based on the degree to which students improve from a test performance in 1 year to the next, referred to as “growth” approach (Ladd & Lauen, 2010). Both approaches have different goals and thus measure different outcomes; while status-based systems incentivize schools to increase student performance to a certain level, growth-based systems encourage schools to show relative improvement in student performance, independent of the absolute level of that achievement. Status-based systems are more common in evidence-based policymaking at least in part because they require less data (only 1 year of assessment scores), and they appear more transparent. Growth measures tend to adjust for prior scores and student characteristics because of differences in average growth across groups. As a result, the average score for students in a school is easier for most people to understand than an adjusted measure of average growth. Growth-based systems, on the other hand, are more equitable in that they take into consideration external factors like family background that plays a significant role in student achievement but are also less transparent. A school that ranks low based on status may rank high based on growth and vice versa. An important component of capturing growth in learning, however, is longitudinal data (Lockheed & Wagemaker, 2013). While tests like ILSAs, which do not test the same sample of students each year, may certainly help inform status, they may not be valid as a growth measure for individual students, both within and across jurisdictions. Alongside validity are reliability concerns. Reliability refers to whether the test results are consistently precise; that is, the test may or may not reliably capture what it is intended to capture under the same conditions. Measurement error is a critical component of determining reliability. For example, test scores with significant measurement error would not provide helpful information to the consumer about whether schools improved on a particular dimension regardless of whether they actually improved or not. Poorly designed test instruments or varying test

5

Educational Accountability and the Role of International Large-Scale Assessments

85

administration settings can create measurement error. The time limitations on taking tests can impact the reliability of the results (Hargreaves & Braun, 2013). Creating an overall school performance measure or rating from individual student test scores can also affect score reliability. For ILSAs, reliability considerations are greater, though the ILSAs do have rigorous quality control procedures to support the validity and reliability of their data (see, e.g., the technical manuals of PISA, TIMSS, PIRLS, TALIS, ICCS (The International Civic and Citizenship Education Study)). Given the variation of educational systems across the world, students from two countries with the same content knowledge may answer the same question differently, creating difference in the observed and actual values of the measure. For example, on the PIRLS, one of the background questions asked parents and students how many books they have at home, with the answer choices on the student questionnaire being approximate with graphics of books while the choices on the parental questionnaire being a range of numbers. Singer, Braun, and Chudowsky (2018) found that there were not only discrepancies in responses between students and parents but also across countries; while the average correlation between student and parent reports in Kuwait was 0.35, it was 0.76 in Portugal and 0.92 in Georgia. In recent years, ILSAs have been making a shift toward digitally based assessments (DBA), which can potentially make ILSAs less disposed to error because administrators can oversee students while they are taking the test and the assessments can be adaptive in that they are tailored to a student’s proficiency levels. ILSA experts anticipate improvements in processing, scaling, and reporting as a result of the transition to DBA. Still, these potential benefits depend on building the capacity of students and test administrators to take ILSAs in a DBA format. Sometimes, results of multiple ILSAs are presented as a combined cross-national indicator, especially since countries tend to rely heavily on league tables of average ILSA scores. A poorly constructed cross-national indicator can obscure age-based, grade-based, and subject-based differences across educational systems and thus affect reliability of the measure (Lockheed & Wagemaker, 2013). Even if ILSA scores are reliable for a nation as a whole, they may not be as reliable for sub-national entities (e.g., states in the Unites States). If consumers of the results, such as policymakers, factor in the unique context of the nation or economy when interpreting results, ILSA measures could still provide helpful information on the performance of students. Singer et al. (2018) contend that culture-specific information into all tested countries and their sub-national entities are critical for ensuring reliability and validity of ILSA results. Comprehensiveness is also an important factor in determining whether ILSA scores can be useful in the context of accountability. Do the measures cover progress toward all intended goals? While a single assessment or even a set of assessments, as ILSAs are, cannot capture all valued outcomes, ILSAs do measure a large range of capabilities for students. TIMSS and PIRLS focus largely on mathematics and reading, but since its beginnings, PISA has been including multiple additional outcomes. Furthermore, ILSAs, such as ICCS, focus on citizenship education as well attitudes and behaviors toward demographic groups. Realistically, one test

86

S. Loeb and E. Byun

cannot capture all goals for students. Thus, the scope and content of a test or battery of tests speaks to which goals are being prioritized and signal the values of the nation. In the United States, for example, the most popular intranational tests such as the NAEP focus primarily on mathematics and English Language Arts, though they also cover other areas less frequently, suggesting that in the United States, being numerate and literate may be considered the most important goals. However, other educational outcomes matter, as well. Brighouse, Ladd, and Loeb (2018), for example, offer a wide swathe of knowledge, skills, attitudes, and dispositions that influence the flourishing of individuals and of others by providing capacities for economic productivity, personal autonomy, democratic competence, healthy personal relationships, treatment of others as equals, and personal fulfillments. Capturing all goals of a single jurisdiction or economy in one test is close to impossible. Even limiting to a quite narrow field, such as middle school mathematics, the range of possible skills to measure and the weights to give each of these skills can affect performance, especially across heterogeneous countries. Finding an international assessment that is comprehensive for a range of jurisdictions is difficult. ILSAs in DBA format can potentially offer a wider range of question formats, which may allow for more comprehensive measures that a multiple-choice format would not be able to support, but the fundamental question of domain coverage will remain. No measure is perfect, regardless of whether the measure derives from an intranational or international test. Wagemaker (2020) provides a detailed analysis of the reliability and validity of ILSAs, pointing out that the number of countries participating in the assessments has grown considerably during the last 60 years, and, as a result, new methodologies and assessment strategies are needed to ensure reliability and validity. However, the current alternative to imperfect measures – not using test results at all – may yield other issues. Without assessment results, stakeholders including parents, school administrators, and the local community, may have even poorer information to make decisions about which school their child should attend, where to live, and whether a particular school is supporting students in developing the capacities and skills that society values. The framework of considering the validity, reliability, and comprehensiveness of ILSAs provides a foundation for determining whether the imperfections of the ILSAs outweigh the usefulness of the measures for accountability and vice versa. While ILSAs may be helpful in accountability systems, the question remains whether outcomes-based accountability itself actually improves educational opportunities and student learning.

Research on Administrative Accountability and the Role of ILSAs No research that we know of assesses the effects of outcomes-based accountability processes that use ILSAs as the measure in the accountability system. However, researchers have used ILSAs, alongside intranational tests, to assess the effects of accountability. Currently, more research covers the effects of outcomes-based accountability systems on student achievement using intranational tests rather than

5

Educational Accountability and the Role of International Large-Scale Assessments

87

ILSAs and is focused on outcomes-based accountability in the United States. For example, Carnoy and Loeb (2002) assessed the effects of state implementation of accountability practices prior to the No Child Left Behind Act (NCLB) and found that states that implemented outcome-based accountability showed greater gains on NAEP. The tests used for these accountability systems are statewide tests administered to all students in specified grades. Similarly, Dee and Jacob (2009) used NAEP data to compare the performance gains of states with and without school-level accountability systems before the NCLB, which required states to test students in grades 3 through 8 and in high school. The authors found larger achievement gains in math for fourth and eighth grade students in states without school-level accountability systems before the NCLB. Studies using both NAEP and state assessment data often find that results using the NAEP, which are not tied accountability in the United States, are typically smaller than those using state assessments directly tied to accountability systems (Jacob, 2007; Jennings & Lauen, 2016). Overall, US scholars have found small to moderate positive impacts of accountability systems in the United States on student achievement in math as measured by student performance on low-stakes assessments such as NAEP (Jennings & Lauen, 2016; Carnoy & Loeb, 2002; Dee & Jacob, 2009; Hanushek & Raymond, 2004; Jacob, 2005, 2007; Rouse et al., 2007; Lauen & Gaddis, 2016). Though, on average, the research literature finds positive test performance effects, some studies no significant effects of accountability on student performance (Lee & Reeves, 2012; Lee & Wong, 2004; Smith & Mickelson, 2016). ILSAs have provided evidence for evidenced-based decision-making even if they are not commonly used as the outcome measure for test-based accountability. For example, researchers have used ILSAs to assess the effects of external exit exams. These studies have found that students in jurisdictions with external exit exams perform better than those in jurisdictions without an exit exam system (Fuchs & Wößmann, 2007; Wößmann, 2003, 2005; Bishop, 1997, 2006). For example, Schutz, West, and Wößmann (2007) found evidence that students in jurisdictions with external exit exams had higher math scores on the PISA compared to students in jurisdictions without external exit exams. The same study found that accountability practices aimed at teachers (e.g., monitoring teacher lessons) increased the scores of students but also reduced the equity in those jurisdictions, measured by the disparity between low and high socioeconomic status students. Similarly, Camminatiello, Paletta, and Speziale (2006) found that school-based accountability, measured by responses on the 2006 PISA questionnaire, positively impacted achievement of students in 57 participating jurisdictions as measured by the third cycle of PISA and that the combination of accountability and school autonomy over teacher salaries produced a stronger positive effect. Wößmann (2005) found more nuanced results in his study. Systems with accountability mechanisms through central exams (TIMSS) had greater student performance with or without schools having autonomy over teacher salaries, while systems with both school autonomy and central exams resulted in higher math performance on the TIMSS than their counterparts without school autonomy. However, systems with school autonomy over teacher salaries but without central exams fared worse with

88

S. Loeb and E. Byun

respect to math performance on the TIMSS compared to school systems without school autonomy and central exams. Woessmann (2007) contends that these results indicate that the presence of accountability systems create positive incentives for schools to responsibly utilize their autonomy to promote student performance. In contrast, Gandara and Randall (2015) studying Australia, Korea, Portugal, and the United States found that school-level accountability practices had a small negative effect on science achievement on the PISA 2006; and a paper by Yi (2015) examined the school accountability system and autonomy practices in Korea, which adopted the National Assessment of Educational Achievement (NAEA) in 2008, published school performance reports starting in 2010, and initiated a program to provide financial incentives to low-performing schools based on NAEA results. The study did not find effects of school accountability on student math achievement on the PISA 2003 or 2012. Similarly, using PISA, Kameshwara, Sandoval-Hernandez, Shields, and Dhanda (2020) found no effects of decentralization of school decision-making on student outcomes. As the mixed results of these studies indicate, test-based accountability may not always improve school performance and educational systems. One explanation for the mixed or negative results comes from unintended responses from system stakeholders. Teaching to the test has been a widely cited issue for intranational tests (Rosenkvist, 2010; Gandara & Randall, 2015). As an extreme example, some Chicago teachers fraudulently completed student exams in response to accountability pressures (Jacob & Levitt, 2003). Other studies have documented the narrowing of the curriculum, an unintended consequence of the fact that most assessments cannot be truly comprehensive (Judson, 2012; Koretz, 2009; Gandara & Randall, 2015). Schools can also manipulate factors to yield more favorable results on standardized assessments. These factors include changing discipline and suspension practices (Figlio, 2006) as well as meal programs (Figlio & Winicki, 2005) at the same time as administering tests and strategically placing teachers in certain grades and schools (Boyd et al., 2008). Moreover stakeholders may not be able to respond to incentives from accountability systems. For example, resources necessary to make improvements may not be available, limiting the extent to which stakeholders can act upon the information from accountability systems. The leadership similarly may not have the capacity to enact changes to education policies and practices. Especially when policymakers are trying to implement educational reforms, other salient priorities and the political bureaucracy of the country or economy may impede prompt responsiveness to accountability. In these cases, accountability systems would not lead to improved school performance and educational systems. These studies are samples of those that have assessed the effects of test-based accountability policies. The United States has the NAEP as well as state assessments that provide large-scale data for these studies. Outside of the United States, ILSAs have been useful for assessing the effects of accountability, as well as the effects of a wide range of other education policies from peer effects (Rangvid, 2003) to computers (Bielefeldt, 2005) to inquiry teaching (Jiang & McComas, 2015). Overall, this research points to the potential usefulness of ILSAs for

5

Educational Accountability and the Role of International Large-Scale Assessments

89

evidenced-based decision-making, but not necessarily for use as a measure of school or educator effectiveness, given the challenges to its reliability, validity, and comprehensiveness for this purpose.

Conclusion International large-scale assessments (ILSAs) emerged within the context of growing global interest in evidence-based policymaking and outcomes-based accountability systems. They have played a role in these processes. Policymakers have used ILSAs to rally interest in educational improvement and to benchmark their jurisdiction against international norms. Researchers have used ILSA results to assess the effects of a range of educational policies and practices, including those tied to accountability approaches. However, the ILSAs are not well suited as measures of school improvement for outcomes-based accountability itself. No assessments are perfect. They vary in their validity for each potential purpose, in their reliability, and in their comprehensiveness. ILSAs face particular challenges because of the heterogeneity of the population of students and the educational contexts in which they are applied. These challenges may make the assessments less valid than many local assessments for measuring student learning toward local goals. However, the main shortcoming of ILSAs for use in accountability systems is that they do not cover the full population of students and they do not link students year to year to produce reliable measures of learning gains. As a result, while they may be valid measures of student achievement at a given time, they are less valid tools for measuring student learning over time or schools’ contribution to that learning. Despite these limitations, ILSAs are useful, at least in part, because of the paucity of alternative measures. ILSAs provide a broad swathe of the public information about their schools and their students’ learning. Moreover, researchers have used ILSAs to convincingly assess the effects of policies and practices on student learning, producing information that can then be used for evidence-based decisionmaking.

References A Nation at Risk: The Imperative for Educational Reform. (1983). The National Commission on Excellence in Education. https://www.edreform.com/wp-content/uploads/2013/02/A_Nation_ At_Risk_1983.pdf Addey, C., & Sellar, S. (2017). A framework for analysing the multiple rationales for participating in international large-scale assessments. Compare. A Journal of Comparative and International Education. https://doi.org/10.1080/03057925.2017.1301399 Addey, C., Sellar, S., Steiner-Khamsi, G., Lingard, B., & Verger, A. (2017). The rise of international large-scale assessments and rationales for participation. Compare: A Journal of Comparative and International Education, 47, 1–19. https://doi.org/10.1080/03057925.2017.1301399

90

S. Loeb and E. Byun

Baker, D., & LeTendre, G. (2005). National differences, global similarities: World culture and the future of schooling. Stanford University Press. https://www.sup.org/books/title/?id¼7192 Baker, D., & Wiseman, A. (2005). Global trends in educational policy. http://Lst-Iiep.Iiep-Unesco. Org/Cgi-Bin/Wwwi32.Exe/[In¼epidoc1.in]/?T2000¼025023/(100), 6. Baroutsis, A., & Lingard, B. (2018, February 18). PISA-shock: How we are sold the idea our PISA rankings are shocking and the damage it is doing to schooling in Australia. EduResearch Matters. https://www.aare.edu.au/blog/?p¼2714 Bielefeldt, T. (2005). Computers and student learnings. Journal of Research on Technology in Education, 37(4), 339–347. https://doi.org/10.1080/15391523.2005.10782441 Bishop, J. (1997). The effect of national standards and curriculum-based examinations on achievement. American Economic Review, 87(2), 260–264. Bishop, J. (2006). Drinking from the fountain of knowledge: Student incentive to study and learn – Externalities, information problems and peer pressure, ch. 15. In E. Hanushek & F. Welch (Eds.) (pp. 909–944). Elsevier. https://EconPapers.repec.org/RePEc:eee:educhp:2-15 Bloem, S. (2015). PISA for low- and middle-income countries. Compare: A Journal of Comparative and International Education, 45(3), 481–486. https://doi.org/10.1080/03057925.2015. 1027513 Boyd, D., Lankford, H., Loeb, S., Rockoff, J., & Wyckoff, J. (2008). The narrowing gap in New York City teacher qualifications and its implications for student achievement in highpoverty schools. Journal of Policy Analysis and Management, 27(4), 793–818. https://doi.org/ 10.1002/pam.20377 Braun, H. I., & Singer, J. D. (2019). Assessment for monitoring of education systems: International comparisons. The Annals of the American Academy of Political and Social Science, 683(1), 75–92. https://doi.org/10.1177/0002716219843804 Brighouse, H., Ladd, H. F., Loeb, S., & Swift, A. (2018). Educational goods: Values, evidence, and decision-making. University of Chicago Press. https://www.press.uchicago.edu/ucp/books/ book/chicago/E/bo27256234.html Camminatiello, I., Paletta, A., & Speziale, M. T. (2006). The effects of school-based management and standards-based accountability on student achievement: Evidence from PISA 2006. Electronic Journal of Applied Statistical Analysis, 5(3), 6. Carnoy, M. (1999). Globalization and educational reform: What planners need to know (Fundamentals of education planning). United Nations Educational, Scientific, and Cultural Organization. http://unesco.amu.edu.pl/pdf/Carnoy.pdf Carnoy, M. (2015). International test score comparisons and educational policy: A review of the critiques. https://nepc.colorado.edu/publication/international-test-scores Carnoy, M., & Loeb, S. (2002). Does external accountability affect student outcomes? A cross-state analysis. Educational Evaluation and Policy Analysis, 24(4), 305–331. https://doi.org/10.3102/ 01623737024004305 Carnoy, M., & Rhoten, D. (2002). What does globalization mean for educational change? A comparative approach. Comparative Education Review, 46(1), 1–9. https://doi.org/10.1086/ 324053 Carnoy, M., Garcia, E., & Khavenson, T. (2015). Bringing it back home: Why state comparisons are more useful than international comparisons for improving U.S. education policy (No. 410). Economic Policy Institute. https://www.epi.org/publication/bringing-it-back-home-why-statecomparisons-are-more-useful-than-international-comparisons-for-improving-u-s-educationpolicy/ Dale, R. (2005). Globalisation, knowledge economy and comparative education. Comparative Education, 41(2), 117–149. Dee, T., & Jacob, B. (2009). The impact of no child left behind on student achievement. Working paper no. 15531, National Bureau of Economic Research. https://doi.org/10.3386/w15531. Desa, D. (2014). Evaluating measurement invariance of TALIS 2013 complex scales: Comparison between continuous and categorical multiple-group confirmatory factor analyses. OECD Publishing.

5

Educational Accountability and the Role of International Large-Scale Assessments

91

Dorn, S. (2007). Accountability Frankenstein: Understanding and taming the monster. Information Age Pub. Ertl, H. (2006). Educational standards and the changing discourse on education: The reception and consequences of the PISA study in Germany. Oxford Review of Education, 32(5), 619–634. JSTOR. Eryilmaz, N., Rivera-Gutiérrez, M., & Sandoval-Hernández, A. (2020). Should different countries participating in PISA interpret socioeconomic background in the same way? A measurement invariance approach. Revista Iberoamericana de Educación, 84(1), 109–133. Eurydice, E. (2009). Early childhood education and care in Europe: Tackling social and cultural inequalities. Brussels: Eurydice. Ferrer, J. G. (2006). Educational assessment systems in Latin America: Current practice and future challenges. PREAL. Feuer, M., Braun, H., Kober, N., & Berman, A. (2015). An agenda for understanding the impact of college rankings on various users and better meeting users’ information needs. Graduate School of Education and Human Development, George Washington University. Figlio, D. N. (2006). Testing, crime and punishment. Journal of Public Economics, 90(4), 837–851. https://doi.org/10.1016/j.jpubeco.2005.01.003 Figlio, D., & Kenny, L. (2009). Public sector performance measurement and stakeholder support. Journal of Public Economics, 93(9–10), 1069–1077. Figlio, D., & Loeb, S. (2011). School accountability. In Handbooks in economics (Vol. 3). Elsevier. https://doi.org/10.1016/S0169-7218(11)03008-5 Figlio, D. N., & Winicki, J. (2005). Food for thought: The effects of school accountability plans on school nutrition. Journal of Public Economics, 89(2–3), 381–394. Froese-Germain, B. (2010). The OECD, PISA and the impacts on educational policy (p. 35). Canadian Teachers’ Federation. https://files.eric.ed.gov/fulltext/ED532562.pdf Fuchs, T., & Wößmann, L. (2007). What accounts for international differences in student performance? A re-examination using PISA data. Empirical Economics, 32(2), 433–464. https://doi. org/10.1007/s00181-006-0087-0 Gandara, F., & Randall, J. (2015). Investigating the relationship between school-level accountability practices and science achievement. Education Policy Analysis Archives, 23, 112. https://doi. org/10.14507/epaa.v23.2013 Goldstein, D. (2019, December 3). ‘It just isn’t working’: PISA test scores cast doubt on U.S. education efforts. The New York Times. https://www.nytimes.com/2019/12/03/us/usstudents-international-test-scores.html Gunter, H. M., Grimaldi, E., Hall, D., & Serpieri, R. (Eds.). (2016). New public management and the reform of education (1st ed.). Routledge. Hall, K. (2005). Science, globalization, and educational governance: The political rationalities of the new managerialism. 12 Indiana Journal of Global Legal Studies, 12(1), 153. https://www. repository.law.indiana.edu/ijgls/vol12/iss1/5 Hanushek, E. A., & Raymond, M. E. (2004). Does school accountability Lead to improved student performance?. Working paper no. 10591, National Bureau of Economic Research. https://doi. org/10.3386/w10591. Hargreaves, A., & Braun, H. (2013). Data-driven improvement and accountability (p. 47). National Education Policy Center. https://www.education.nh.gov/essa/documents/data-drivenimprovement.pdf Horn, M. (2020). What may lurk behind Korea’s declining PISA scores. Forbes. https://www. forbes.com/sites/michaelhorn/2020/01/09/what-may-lurk-behind-koreas-declining-pisa-scores/ #19c08d412f8d Isac, M. M., Palmerio, L., & van der Werf, M. G. (2019). Indicators of (in) tolerance toward immigrants among European youth: An assessment of measurement invariance in ICCS 2016. Large-Scale Assessments in Education, 7(1), 6. Jacob, B. (2005). Accountability, incentives and behavior: Evidence from school reform in Chicago. Journal of Public Economics, 89(5–6), 761–796.

92

S. Loeb and E. Byun

Jacob, B. A. (2007). Test-based accountability and student achievement: An investigation of differential performance on NAEP and state assessments. Working paper no. 12817, National Bureau of Economic Research. https://doi.org/10.3386/w12817. Jacob, B. A., & Levitt, S. D. (2003). Rotten apples: An investigation of the prevalence and predictors of teacher cheating. Working paper no. 9413, National Bureau of Economic Research. https://doi.org/10.3386/w9413. Jacobson, L. (2019). Beyond NAEP: Experts seek ways to address US “reading crisis.” Education Dive. https://www.educationdive.com/news/beyond-naep-experts-look-for-ways-to-address-usreading-crisis/567487/ Jennings, J. L., & Lauen, D. L. (2016). Accountability, inequality, and achievement: The effects of the no child left behind act on multiple measures of student learning. RSF, 2(5), 220–241. https:// doi.org/10.7758/RSF.2016.2.5.11 Jiang, F., & McComas, W. F. (2015). The effects of inquiry teaching on student science achievement and attitudes: Evidence from propensity score analysis of PISA data. International Journal of Science Education, 37(3), 554–576. https://doi.org/10.1080/09500693.2014.1000426 Judson, E. (2012). When science counts as much as reading and mathematics: An examination of differing state accountability policies. Education Policy Analysis Archives, 20, 26. Kamens, D. H., & McNeely, C. L. (2010). Globalization and the growth of international educational testing and national assessment. Comparative Education Review, 54(1), 5–25. JSTOR. https:// doi.org/10.1086/648471 Kameshwara, K. K., Sandoval-Hernandez, A., Shields, R., & Dhanda, K. R. (2020). A false promise? Decentralization in education systems across the globe. International Journal of Educational Research, 104, 101669. https://doi.org/10.1016/j.ijer.2020.101669 Kellaghan, T. (2001). The globalisation of assessment in the 20th century. Assessment in Education: Principles, Policy & Practice, 8(1), 87–102. https://doi.org/10.1080/09695940120033270 Koretz, D. (2009). Measuring up: What educational testing really tells us (8/16/09 edition). Harvard University Press. Ladd, H. F., & Lauen, D. L. (2010). Status versus growth: The distributional effects of school accountability policies. Journal of Policy Analysis and Management, 29(3), 426–450. Lauen, D. L., & Gaddis, S. M. (2016). Accountability pressure, academic standards, and educational triage. Educational Evaluation and Policy Analysis, 38(1), 127–147. Lee, J., & Reeves, T. (2012). Revisiting the impact of NCLB high-stakes school accountability, capacity, and resources: State NAEP 1990–2009 reading and math achievement gaps and trends. Educational Evaluation and Policy Analysis, 34(2), 209–231. https://doi.org/10.3102/ 0162373711431604 Lee, J., & Wong, K. K. (2004). The impact of accountability on racial and socioeconomic equity: Considering both school resources and achievement outcomes. American Educational Research Journal, 41(4), 797–832. JSTOR. Leithwood, K., & Earl, L. (2000). Educational accountability effects: An international perspective. Peabody Journal of Education, 75(4), 1–18. https://doi.org/10.1207/S15327930PJE7504_1 Lingard, B., Martino, W., & Rezai-Rashti, G. (2013). Testing regimes, accountabilities and education policy: Commensurate global and national developments. Journal of Education Policy, 28(5), 539–556. https://doi.org/10.1080/02680939.2013.820042 Lockheed, M. E., & Wagemaker, H. (2013). International large-scale assessments: Thermometers, whips or useful policy tools? Research in Comparative and International Education, 8(3), 296–306. https://doi.org/10.2304/rcie.2013.8.3.296 Loeb, S., & Byun, E. (2019). Testing, accountability, and school improvement. The Annals of the American Academy of Political and Social Science, 683(1), 94–109. https://doi.org/10.1177/ 0002716219839929 Meyer, H.-D., & Benavot, A. (2013). PISA, power, and policy. Symposium Books. http://www. symposium-books.co.uk/bookdetails/85/ Mizala, A., Romaguera, P., & Urquiola, M. (2007). Socioeconomic status or noise? Tradeoffs in the generation of school quality information. Journal of Development Economics, 84(1), 61–75. https://doi.org/10.1016/j.jdeveco.2006.09.003

5

Educational Accountability and the Role of International Large-Scale Assessments

93

Mok, M. M. C. (2007). Quality assurance and school monitoring in Hong Kong. Educational Research for Policy and Practice, 6(3), 187–204. https://doi.org/10.1007/s10671-007-9027-9 Montoya, S. (2018, April 25). A sound investment: The benefits of large-scale learning assessments. UNESCO Institute of Statistics. http://uis.unesco.org/en/blog/sound-investment-benefits-largescale-learning-assessments National Testing of Pupils in Europe: Objectives, Organisation and Use of Results. (2009). Education, Audiovisual and Culture Executive Agency. https://op.europa.eu/en/publicationdetail/-/publication/df628df4-4e5b-4014-adbd-2ed54a274fd9/language-en Nutley, S., Walter, I., & Davies, H. (2007). Using evidence: How research can inform public services. The Policy Press. https://www.press.uchicago.edu/ucp/books/book/distributed/U/ bo13441009.html OECD. (2012). Does performance-based pay improve teaching? (PISA in focus). The Organisation for Economic Cooperation and Development. http://www.oecd.org/pisa/pisaproducts/ pisainfocus/50328990.pdf OECD. (2013a). PISA 2012 results: What makes schools successful? Resources, policies and practices: Vol. IV. OECD Publishing. https://doi.org/10.1787/9789264201156-8-en OECD. (2013b). School governance, assessments and accountability. In OECD (Ed.), PISA 2012 results: What makes schools successful (volume IV) (pp. 127–164). OECD. https://doi.org/10. 1787/9789264201156-8-en OECD. (2018). PISA 2015: Results in focus. Organisation for Economic Cooperation and Development. https://www.oecd.org/pisa/pisa-2015-results-in-focus.pdf Pawson, R. (2006). Evidence-based policy. Sage. https://doi.org/10.4135/9781849209120 Pizmony-Levy, O., & Bjorklund, J. (2014). International assessments of student achievement and public confidence in education: Evidence from a cross-national study. https://doi.org/10.7916/ D8HH6XF4 Postlethwaite, N. (1967). School organization and student achievement. New York: Wiley. Postlethwaite, T. N., & Ross, K. N. (1992). Effective schools in reading: Implications for educational planners. An exploratory study. https://eric.ed.gov/?id¼ED360614 Powell, W. W., & Snellman, K. (2004). The knowledge economy. Annual Review of Sociology, 30(1), 199–220. https://doi.org/10.1146/annurev.soc.29.010202.100037 Principles of evidence-based policymaking. (2016). Evidence-Based Policymaking Collaborative. https://www.urban.org/sites/default/files/publication/99739/principles_of_evidence-based_ policymaking.pdf Rangvid, B. S. (2003). Educational peer effects quantile regression evidence from Denmark with PISA2000 data (p. 41). Institute of Local Government Studies. http://www.oecd.org/denmark/ 33684822.pdf Rizvi, F. (2009). Globalizing education policy (1st ed.). Routledge. Robertson, S., & Dale, R. (2015). Towards a ‘critical cultural political economy’ account of the globalising of education. Globalisation, 13. https://doi.org/10.1080/14767724.2014.967502 Robitaille, D. F., Garden, R. A., & International Association for the Evaluation of Educational Achievement. (1989). The IEA study of mathematics II: Contexts and outcomes of school mathematics. Pergamon Press. Rosenkvist, M. A. (2010). Using student test results for accountability and improvement: A literature review. No. 54; OECD education working papers, Organisation for Economic Cooperation and Development. https://eric.ed.gov/?id¼ED529582 Rothman, B. (2019). Inspection systems: How top-performing nations hold schools accountable [National Center on Education and the Economy]. http://ncee.org/2018/05/how-topperforming-nations-hold-schools-accountable/ Rouse, C. E., Hannaway, J., Goldhaber, D., & Figlio, D. (2007). Feeling the Florida Heat? How Low-Performing Schools Respond to Voucher and Accountability Pressure. Working paper no. 13681, National Bureau of Economic Research. https://doi.org/10.3386/w13681. Rutkowski, L., & Rutkowski, D. (2018). Improving the comparability and local usefulness of international assessments: A look back and a way forward. Scandinavian Journal of Educational Research, 62(3), 354–367.

94

S. Loeb and E. Byun

Rutkowski, L., & Svetina, D. (2017). Measurement invariance in international surveys: Categorical indicators and fit measure performance. Applied Measurement in Education, 30(1), 39–51. Sahlberg, P. (2016). The global educational reform movement and its impact on schooling. In The handbook of global education policy (pp. 128–144). Wiley. https://doi.org/10.1002/ 9781118468005.ch7 Sanderson, I. (2006). Complexity, “practical rationality,” and evidence-based policy making. Policy and Politics, 34, 115–132. Sandoval-Hernandez, A., Rutkowski, D., Matta, T., & Miranda, D. (2019). Back to the drawing board: Can we compare socioeconomic background scales? Revista de Educación, 383, 37–61. Schutz, G., West, M., & Wößmann, L. (2007). School accountability, autonomy, choice, and the equity of student achievement: International evidence from PISA 2003. No. 14; OECD education working papers series, Economic Cooperation and Development. http://www.oecd.org/ education/39839422.pdf Singer, J. D., & Braun, H. I. (2018). Testing international education assessments. Science, 360(6384), 38–40. https://doi.org/10.1126/science.aar4952 Singer, J., Braun, H., & Chudowsky, N. (Eds.). (2018). International education assessments: Cautions, conundrums, and common sense. National Academy of Education. https:// naeducation.org/methods-and-policy-uses-of-international-large-scale-assessments/ Smith, W. C. (2014). The global transformation toward testing for accountability. Education Policy Analysis Archives, 22(0), 116. https://doi.org/10.14507/epaa.v22.1571 Smith, S. S., & Mickelson, R. A. (2016). All that glitters is not gold: School reform in CharlotteMecklenburg. Educational Evaluation and Policy Analysis. https://doi.org/10.3102/ 01623737022002101 Steiner-Khamsi, G. (2010). The politics and economics of comparison. Comparative Education Review, 54(3), 323–342. JSTOR. https://doi.org/10.1086/653047 Teltemann, J., & Jude, N. (2019). Assessments and accountability in secondary education: International trends. Research in Comparative and International Education, 14(2), 249–271. https:// doi.org/10.1177/1745499919846174 Torney-Purta, J., & Amadeo, J. A. (2013). International large-scale assessments: Challenges in reporting and potentials for secondary analysis. Research in Comparative and International Education, 8(3), 248–258. https://doi.org/10.2304/rcie.2013.8.3.248 Verger, A. (2017). Theorising ILSA participation. Compare: A Journal of Comparative and International Education. https://doi.org/10.1080/03057925.2017.1301399 Verger, A., Fontdevila, C., & Zancajo, A. (2016). The privatization of education: A political economy of global education reform. Teachers College Press. https://www.researchgate.net/ publication/305302937_The_Privatization_of_Education_A_Political_Economy_of_Global_ Education_Reform von Davier, M. (2013). In E. Gonzalez, I. Kirsch, & K. Yamamoto (Eds.), The role of international large-scale assessments: Perspectives from technology, economy, and educational research. Springer Netherlands. https://www.springer.com/gp/book/9789400746282 Wagemaker, H. (2020). Reliability and validity of international large-scale assessment: Understanding IEA’s comparative studies of student achievement. Springer. https://library.oapen.org/ handle/20.500.12657/41740 Willis, J., Krausen, K., Byun, E., & Caparas, R. (2018). In the era of the local control funding formula: The shifting role of California’s chief business officers (Getting down to facts II). Policy Analysis for California Education. https://www.gettingdowntofacts.com/sites/default/ files/2018-09/GDTFII_Report_Willis.pdf Wiseman, A. W. (2010). The uses of evidence for educational policymaking: Global contexts and international trends. Review of Research in Education. https://doi.org/10.3102/ 0091732X09350472 Wiseman, A. W., & Baker, D. P. (2005). The worldwide explosion of internationalized education policy. In D. P. Baker, & A. W. Wiseman (Eds.), Global trends in educational policy

5

Educational Accountability and the Role of International Large-Scale Assessments

95

(International Perspectives on Education and Society, Vol. 6) (pp. 1–21). Emerald Group Publishing Limited, Bingley. https://doi.org/10.1016/S1479-3679(04)06001-3. Wößmann, L. (2001). Why students in some countries do better: International evidence on the importance of education policy. Education Matters, 2(2), 67–74. Wößmann, L. (2003). Schooling resources, educational institutions and student performance: The international evidence. Oxford Bulletin of Economics and Statistics, 65(2), 117–170. https://doi. org/10.1111/1468-0084.00045 Wößmann, L. (2005). The effect heterogeneity of central examinations: Evidence from TIMSS, TIMSS-repeat and PISA. Education Economics, 13(2), 143–169. https://doi.org/10.1080/ 09645290500031165 Woessmann, L. (2001). Why students in some. Peabody Journal of Education, 82(2–3), 473–497. Woessmann, L. (2007). International evidence on school competition, autonomy and accountability: A review. Peabody Journal of Education, 82(2–3), 473–497. Yi, P. (2015). Do school accountability and autonomy affect PISA achievement? Evidence from South Korea. Korean Educational Development Institute. https://www.researchgate.net/ publication/290457410_Do_school_accountability_and_autonomy_affect_pisa_achievement_ Evidence_from_South_Korea

6

International Large-Scale Assessments and Education System Reform On the Power of Numbers M. Ehren

Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Rise of International Large-Scale Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Power of Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Changing Cognition and Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Changing Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Penetrating Schools and Classrooms: PISA for Schools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An Alternative Model: The OECD’s SEG Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Learning Seminars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Holistic Approach to International Comparisons? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98 100 103 104 105 108 109 111 112 113 114 116

Abstract

International large-scale assessments, such as PISA and TIMSS, have had a profound effect on educational policy and school accountability. In an open letter to the OECD in 2014, a large number of academics expressed their concerns over the continuous cycle of global testing and how it negatively affects student’s wellbeing and impoverishes classrooms, as it inevitably involves more and longer batteries of multiple-choice testing, more scripted “vendor”-made lessons, and less autonomy for teachers. Others have pointed to the ways in which international large-scale assessments have opened up education systems and built country’s assessment capacities. This chapter reviews the ways in which

M. Ehren (*) Vrije Universiteit Amsterdam FGB Boechorststraat, Amsterdam, The Netherlands University College London, Institute of Education, London, UK e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_36

97

98

M. Ehren

international standardized assessments permeate national policy and school-level decision-making, drawing on the sociology of numbers to understand the appeal of standardization and benchmarking. The OECD initiative “PISA for Schools” is described and how it potentially extends the influence of international assessments on the classroom. An alternative project within the OECD (the “Strategic Education Governance” project) is presented and how the approach of case studies and learning seminars allows countries to learn more holistically about how to hold schools accountable and improve education. Openness, situational awareness, and a consideration for the complexity of successful reform are presented as necessary “habits of mind” to successfully improve learning outcomes. The final conclusion reflects on the limitations and potential usage of international large-scale assessments for effective education reform. Keywords

Accountability · Commensuration · Standardization · Educational governance

Introduction International large-scale assessments (ILSAs) are international assessments of academic subjects and other educational indicators that target large and representative samples of students and/or teachers, as well as other stakeholders in education such as school principals or parents. Klieme (2020) describes how these assessments were first implemented around 1960 with the first studies of the International Association for the Evaluation of Student Achievement (IEA). Today, the most cited studies include the Progress in International Reading Literacy Study (PIRLS, run by IEA), the Programme for International Student Assessment (PISA, run by OECD), and Trends in International Mathematics and Science (TIMSS, run by IEA). These studies measure student outcomes in various subject areas and also include additional background questionnaires on student, teaching, and school conditions. PISA 2015, for example, asks principals about formative testing in the school and practices of school evaluation, while TIMSS has asked questions about students’ background, national context, national curriculum, and other school and classroom factors in its various rounds of administration. These background questionnaires aim to enhance their users’ understanding of the home, community, school, and student factors associated with student achievement. Wagemaker (2020) and Hegarty and Finlay (2020) describe how these assessments (including the additional questionnaires) have changed significantly since their introduction early 1960s. Not only has there been an increase in the number of countries participating in the various programs, but the governance of these assessments has also changed. Where researchers dominated the development of ILSAs in the early days, over time the policy governance bodies, such as of the IEA, have seen an increase in members from government or government-related agencies (Pizmony-Levy et al., 2014). The shift has, according to these authors, resulted in a

6

International Large-Scale Assessments and Education System Reform

99

shift from a research-oriented rationale toward an emphasis on policy and informing educational policy on an international, national, regional, and sometimes even local (school) level. Such a policy orientation is informed by the goal of ILSAs, summarized by Klieme (2020, p. 147) as “to provide indicators on the effectiveness, equity, and efficiency of educational systems, to set benchmarks for international comparison, and to monitor trends over time.” The OECD’s Programme for International Students Assessment (PISA) – launched in 1997 – for example, has the objective to “develop regular, reliable, and policy-relevant indicators on student achievement (https://www.oecd.org/pisa/ contacts/howtojoinpisa.htm).” The assessment of cross-curriculum competencies of 15-year-olds in literacy, mathematics, and science intends to inform policymakers about the performance of their country’s education system by delivering the following: (a) A set of basic indicators that will provide policymakers with a baseline profile of the knowledge, skills, and competencies of students in their country. (b) A set of contextual indicators that will provide insight into how such skills relate to important demographic, social, economic, and educational variables (c) Trend indicators that will become available because of the ongoing cyclical nature of the data collections (d) A knowledge base that will lend itself to further focused policy analysis Both PISA and TIMSS present the mean scores of countries in each subject domain and provide rankings of countries against one another. The aim is to provide an international benchmark and to assist policymakers in the implementation of effective policies in education, according to Pizmony-Levy et al. (2014). The use of these rankings by policymakers is enhanced by various publications to illustrate and further explain the outcomes. The OECD, for example, publishes reports and briefs to explain country results and highlight “best practice,” while the TIMSS and PIRLS International Study Center provides various exhibits of education policy and curricula and instruction. Various authors (Klieme, 2020; Rutkowski and Rutkowski, 2016) however highlight the methodological limitations of identifying best practice and using international league tables to inform school and system improvement. The infrequent administration of these assessments, the national sampling of specific age groups, and – apart from TIMSS – the disconnect from national school curricula are just some of the critiques. Regardless of their limits, international large-scale assessments have become powerful tools for influencing the future direction of national education policy (Van Petegem and Vanhoof, 2004; Morgan and Volante, 2016). This chapter reflects on the use of large-scale assessments for education system reform, drawing on the “sociology of quantification” to understand the appeal of international league tables and how these permeate national policy and school-level decision-making. An example is presented of an alternative approach from the OECD: the “Strategic Education Governance” project which is developing a more

100

M. Ehren

holistic set of indicators to learn intelligently about how to hold schools accountable and improve education. The next section first explains the rise of international assessments and how we can understand their influence and appeal.

The Rise of International Large-Scale Assessments International large-scale assessments like TIMSS, PIRLS, and PISA have become major driving factors for system-level monitoring and sources of information about assessment, evaluation, and accountability practices in cross-national comparison, according to Klieme (2020). A report by Lockheed and colleagues (2015) shows how, between 1965 and 2015, the number of countries participating in international large-scale assessments has increased by about two-thirds. Since 2003, all OECD member countries have taken part in PISA with an increase in the number of countries/regions from 32 in 2000 to an estimated 88 in PISA 2021 (http://www. erc.ie/studies/pisa/who-takes-part-in-pisa/). According to Lockheed et al. (2015), the increase in participation in large-scale assessments is largely related to long-standing needs for better education indicators to inform policy. As the United States National Research Council observed in the early 1990s, The lack of an adequate system of education indicators to inform education policy making has become increasingly apparent. (National Research Council, 1993 as cited in Chabbot and Elliott, 2003, p. 4)

Hopfenbeck et al. (2017) review of 144 PISA-related English-language peerreviewed articles from the program’s first cycle in 2000 to 2015 describes the effects of PISA across participating countries. They conclude that PISA is seen by many as having strategic prominence in education policy debates. The rankings of countries on international large-scale assessments have had a profound effect on the educational discourse and educational policies, according to Niyozov and Hughes, 2019. They argue that these rankings have pushed educational reform and improvement onto the global agenda as a place in the top of the league table becomes a “badge of honor” for participating governments (see also Pizmony-Levy et al., 2014). This rise is remarkable given the initial and ongoing scepticism of international large-scale assessments and their use for comparative purposes. Addey and Sellar (2019) explain how, in the 1990s, international assessments were questioned for their lack of validity and relevance for policymaking, benchmarking progress, setting standards, and identifying “what works.” And more recently, Auld and Morris (2014), Sjøberg (2015), and Rutkowski and Rutkowski (2016) have also expressed concern about how the measurement of, and increased emphasis on (narrow), academic outcomes supersedes national values, curricula, and priorities, ignoring active, inquiry-based processes and content and pressurizing countries into narrowing their curriculum. According to these authors, governments too often ignore the lack of cultural specificity of the measure, the fact that indigenous schools

6

International Large-Scale Assessments and Education System Reform

101

and special needs students are excluded from the assessments or potential concerns that have been raised over security of the data. Despite these concerns, large-scale international assessments gained increasing legitimacy over time as the agreed upon proxy for “educational quality,” according to Addey and Sellar (2019). The growing confidence in the statistical rigor of largescale international assessments and the development and expansion of sophisticated technology to test learning skills further enabled their widespread implementation (Verger and Parcerisa, 2017), endorsed and supported by the technical guidance provided by organizations such as the OECD to link national assessments to PISA and use these to inform policymaking in general. Over time, the assumption that comparisons on international assessments are valid became increasingly normalized (Lewis, 2016), leading to a “new paradigm” that (1) the aims and outcomes of different schooling systems are directly commensurable; (2) system performance on such comparative testing is directly correlative to future economic success; and (3) causal factors are universal and absolute (Auld and Morris, 2014). The widespread use of large-scale international assessments and data for policymaking also reflects wider changes in the governance and accountability of public services in general at the time, according to Addey and Sellar (2019). They talk about how the 1970s were marked by a waning trust in professional practice and how a reliance on professionals taking responsibility for learning within schools and schooling systems shifted to competition and comparison as the prevalent methods for improving performance. Both national and international assessments, performance targets, and verification became the key drivers of efficiency and effectiveness in these new modes of governance and essential elements for the development of education and improving performance. Rutkowski (2007) refers to “soft power” to explain how international organizations, such as the OECD, have become highly influential in converging national policy toward an integrated global agenda when fulfilling a role of expert in measuring and evaluating educational policy. As a test grows in world popularity, so does the reputation of the international organization administering the test, codifying the assessment’s legitimacy as a standard to national educational evaluation, according to Rutkowski (2007). Ferrer and Fiszbein (2015) further outline the rationales for participating in international large-scale assessments such as PISA and point to a combination of rationalist, normative, and political economy informed motives. An example of the latter comes from Wiseman (2013) and Grek (2009) who talk about the fact that countries participate in international assessments when they are aiming for OECD or even European Union membership. Comparing oneself against (other) developed countries is a sign of sharing similar educational values and goals; it promotes one’s own legitimacy and credibility as a developed country and would, in the eyes of these countries, improve the chances of membership. The outcomes of international surveys are also thought to provide valuable information about how to become a high performing country. Benchmarking oneself against other countries offers a narrative about how high student outcomes can be explained by (higher) investments in education as well as the institutional determinants (e.g., accountability, school autonomy) or input factors such as books, teachers, and class size.

102

M. Ehren

The perceived value and credibility of international assessments can even supersede a country’s national assessments and practices. Addey and Sellar (2019) provide an example of Paraguay where policy actors claimed the public would trust international student assessment data more than national large-scale learning assessments as these could be labeled as “partial.” Verger and Parcerisa (2017) also explain how standardized student assessments have become a profitable industry and companies which specialize in testing preparation services as well as in the evaluation and tracking of learning outcomes have an economic interest in the growth of testing and measurement. They are known to lobby for the expansion of student standardized assessment and have redefined how countries define “academic knowledge,” fitting this to the academic tests and subsequent education improvement services they sell. Non-state actors, such as advocacy groups, have equally contributed to this process, such as when they use the outcomes of international assessments to hold governments to account by showing that attainment and equity are issues that require intervention. These (public and private) groups have been highly influential in ensuring standardized testing is informing policy, according to Addey and Sellar (2019), particularly when their interests in doing so converge and when they have strong political and economic connections. Verger et al. (2019) talk about a “lock-in effect” to explain the continuity of standardized assessments for macro-policy purposes, even when the effectiveness of intensive use of standardized testing is increasingly questioned by academic evidence. Once these assessments have changed the way we think about “quality” and have infused our decision-making and everyday practice, they are hard to do away with. The “lock-in effect” where international large-scale assessments become a routinized and unchallenged part of education systems however seems to particularly apply to high income countries, and not so much to low and middle income countries. Addey and Sellar (2019) offer a range of examples from developing countries withdrawing from PISA, such as Botswana, South Africa, Kyrgyzstan, and India, where Mexico and Vietnam are also considering to withdraw. Their motivations are grounded in various political, economic, technical, and sociocultural rationales according to Addey and Sellar (2019), but poor performance and avoiding embarrassment in the international community seems to be one of the main reasons. More recent anecdotes also suggest that we might see a return of the 1990s scepticism around international assessments. The 2014 letter from more than 300 academics summarized the increasing critique of the unwarranted influence of the OECD over policy and practice in education and the crude nature of the PISA rankings. Concerns are echoed in Bolivia where the Minister of Education decided to withdraw from PISA for a lack of value in being ranked in relation to other systems. A similar stance has been taken by the European Parliament (initiated by the Polish government) in issuing a statement against decontextualized measures and promoting a more holistic evaluation and understanding of education. The resolution, published in 2016, states that:

6

International Large-Scale Assessments and Education System Reform

103

. . .standardised tests and quantitative approaches to educational accountability measure at best a narrow range of traditional competences, and may result in schools having to adapt teaching syllabi to test material, thus neglecting the intrinsic values of education; points out that education and training have an important role in developing ethical and civil virtues and humanness, whereas teachers’ work and students’ achievements in this area are overlooked by test scores; highlights in this regard the need for flexibility, innovation and creativity in educational settings which can boost learning quality and educational attainment. (EP resolution number P8_TA(2016)0291: point 33)

However, to this date, 80 countries, including from the EU, continue to participate in PISA, while 57 participate in TIMSS. The recent introduction of specific measures for developing countries (PISA for Development) and for schools (Pisa for Schools) suggests that the influence of international large-scale assessments, and particularly PISA, is only increasing. The next section offers an explanation.

The Power of Numbers The appeal of international large-scale assessments and their rankings is a topic that has been studied for a number of years. Various studies on “the sociology of quantification” (e.g., Espeland and Stevens, 1998, 2008; Espeland and Sauder, 2007; Frey et al., 2013) talk about the “power of numbers and rankings” and how these have legitimacy over more qualitative information. One of the groundworks in the field is Desrosières’ (1998) book The Politics of Large Numbers which questions the relationship between quantification and government and analyzes the processes of production and communication of numbers and numeric data in relation to the political power they unleash (Diaz-Bone and Didier, 2016). Subsequent studies, such as by Espeland and colleagues, further regarded the way societies produce their own categories and nomenclature, how numbers are used and change our understanding of diffuse qualities, and how quantitative information has become an increasingly dominant form of information to coordinate, evaluate, and value. “Numbers” in the context of large-scale assessments are a country’s mean score in the subject areas that are tested or on variables measured in background questionnaires (e.g., Science, Reading, and Mathematics in PISA; Mathematics and Science in TIMSS), with further detailed statistics of, for example, the share of top performers and low achievers (PISA), performance of boys versus girls (TIMSS). Most of these numbers are presented in league tables with the best performers on top and the lowest scoring on the bottom. PISA also highlights countries that score below the OECD average in red, suggesting there is a cutoff score or target for countries to meet, even though the use of averages implies that there will always be countries which score below the average. TIMSS includes an international average in its league tables, but without signaling those who score above and below the average. These numbers and rankings are powerful given the ease in using them to compare performance between countries and over time and their ability to reduce messy information into a small set of comparable numbers. They allow relevant

104

M. Ehren

actors to mechanize their decision-making in the face of their own and their organization’s cognitive limitations in collecting and processing more complex qualitative information. Frey et al. (2013) and Sauder and Lancaster (2006) refer to the “illusion of control” when explaining how numbers are embraced for their ease of comparing abstract qualities. Condensing and transforming information into a common, standardized metric eliminates the need for particularistic knowledge in interpreting them (Gregory, 2007; Espeland and Stevens, 2008), and a small set of numbers – such as the ones from international league tables – allow policymakers to more easily decide on how to improve schools and education systems, rather than having to process a large amount of information about uncertain and elusive qualities, such as school climate or quality of instruction (so-called process variables). Numbers in and of themselves are also appreciated for their abstractness; we tend to associate them with rationality and objectivity, and, as Espeland and Stevens (2008) explain, they tend to exude a sense of accuracy and validity as a representation of some quality feature, more so than other types of information (e.g., textual data).

Changing Cognition and Behavior Transforming qualities into a common metric is not a value-free operation; the exercise changes both our cognition and the practice we measure. Espeland and Sauder (2007) talk about self-fulfilling prophecies and commensuration in explaining the power of numbers and how these change people’s and organization’s cognition and respective behaviors. Self-fulfilling prophesies operate by confirming people’s expectations or prediction through a good outcome on these measures, even though the initial measure was false in identifying high-quality performance. Once beliefs are defined as real and confirmed by these measures, they amplify. Over time, numbers and rankings, such as in international league tables, tend to become even more valid, disciplining those which fail to meet the metrics. Sauder and Lancaster’s (2006) well-known example of the introduction of league tables of law schools in the USA provides a compelling illustration; their study shows how league tables were increasingly perceived as a valid measure of the quality of schools and, with the increased use of these tables, further reinforced their validity and legitimacy, rewarding and punishing those that have successfully and unsuccessfully aligned themselves with these measures. As such, these rankings had the type of “lock-in effect” previously described when law schools and employers used them to recruit students, and, as a result, prospective students had more reason to use them as a primary source of decision-making; students’ future career would benefit from going to a highly ranked law school. In turn, administrators, boards of trustees, and other central administrative staff were forced to consider how their decisions affected these rankings and align their organization with the metrics underlying the league table, further establishing the importance and legitimacy of these tables. International large-scale assessments and rankings similarly create epistemic communities and networks of policymakers across the globe who count, measure,

6

International Large-Scale Assessments and Education System Reform

105

and calculate in the same way. Sellar and Lingard (2018) and Sjøberg (2015) talk about how the PISA results have become a global gold standard for educational quality, where the results often set the scene for public debates on the quality of education. The result can be a constrained view of people’s expectation of what a good education system and/or school looks like, where international assessments further reinforce those expectations when a country and/or school aligns its decision-making and practices to the metrics underlying the measure, thereby increasing the legitimacy of the measure. This type of “soft governance” is acknowledged by the international organizations who design these assessments, although they emphasize that the design, development, and implementation of PISA are under the exclusive responsibility of the ministries of education of the participating countries. Lockheed et al. (2015) study of countries’ motivations to join PISA indicates that the results of this assessment can be used for several purposes, where Klieme (2020) highlights their benefits for informing critical debates on assessment, evaluation, and accountability. According to Klieme (2020) and Pizmony-Levy et al. (2014), international large-scale assessments have enabled systems to overcome the purely ideological debates around pedagogical practice which have oftentimes dominated the discourse on educational effectiveness and have opened up country’s educational systems which were previously “hermetically sealed.”

Changing Behaviour Numbers and rankings not only change how we think about quality and come to understand and express quality on a set of common metrics; they often also modify the phenomenon under study as sociologists explain. Espeland and Sauder (2007), for example, refer to “reactivity” in explaining how measures tend to change the very thing one is trying to measure and how the distinction between the act of measuring and the object of measurement is becoming increasingly blurred. International organizations, such as the OECD and IEA, actively contribute to such reactive processes in publishing a set of follow-up reports and policy briefs from its various surveys (e.g., PISA in Focus, Teaching in Focus, Education Indicators in Focus; TIMSS education encyclopedia). These briefings aim to identify features and characteristics of the best performing education systems around the world, facilitating the convergence of education policies toward the metrics in their surveys. Morgan and Volante (2016) draw on the OECD 2013/2014 documentation to summarize the aims of the various surveys: • PISA survey allows educational jurisdictions to evaluate education systems worldwide and provides valuable information to participating countries/economies so they are able to “set policy targets against measurable goals achieved by other education systems and learn from policies and practices applied elsewhere.” • TALIS “sheds light on which [teaching] practices and policies can spur more effective teaching and learning environments.” The OECD claims that TALIS

106

M. Ehren

results enable countries to see more clearly where imbalances might lie and also help teachers, schools, and policymakers learn from these practices at their own level and at other educational levels as well. • The PIAAC survey assists governments in assessing, monitoring, and analyzing the level and distribution of skills among their adult populations, as well as the utilization of skills in different contexts. The tools that accompany the PIAAC survey are designed to support countries/economies as they develop, implement, and evaluate the development of skills and the optimal use of existing skills. • The AHELO (Assessment of Higher Education Learning Outcomes) offers participating institutions the opportunity to identify the main challenges, achievements, and lessons associated with the examination of higher education learning outcomes and juxtaposed their results against their peers. The reactive effect is particularly well-described for PISA. Carvalho and Costa (2015) talk about how the measure created a “PISA lens.” Examples from Hopfenbeck et al.’s (2017) review of 144 articles indicates a PISA effect on policy and governance, curriculum, accountability structures, assessments, and increasing demands for standardization as a result of PISA. Governments are encouraged to implement data-driven policymaking at the national level and have been made increasingly reliant on external authorities such as the OECD for knowledge production and guidance, according to Hopfenbeck et al. (2017). Morgan and Volante (2016) explain how international assessments have also influenced the introduction of national assessments and, in contexts where national assessments were not conducted or were of poor quality, participation in international assessments may have acted as a substitute for the lack of technical capacity (Addey and Sellar, 2019) or to build assessment capacity (Lockheed et al., 2015; Klieme, 2020). However, Lockheed et al.’s (2015) study of countries’ motivations to join PISA suggests a reverse relation, indicating that prior experience with other international large-scale assessments or with national large-scale assessments increases a country’s likelihood of participating in PISA. According to these authors, prior experience may have helped to establish sufficient capacity in a country for undertaking large-scale assessments or will have demonstrated the utility of such endeavors and established a culture of assessment. Regardless of the direction of the effect, participation in international large-scale assessments denotes or is accompanied by a wider shift toward output or test-based accountability, often reinforced by frequent endorsements of such policies by those who operate in the international assessment community, e.g., international organizations developing the assessments, but also the wider research community who work with ILSA data. Hegarty and Finlay (2020) describe how the International Association for Evaluation of Educational Achievement (IEA) has been a major source of scholarship and publication in international large-scale assessment from its early days in the 1960s until the present. By making the datasets on student achievement openly accessible, the organization has allowed scholars worldwide

6

International Large-Scale Assessments and Education System Reform

107

to study and contribute to educational reform and the enhancement of student learning around the world (Hegarty and Finlay, 2020). Such an effect of international large-scale assessments on education systems and reform is however notoriously difficult to demonstrate because education policy agendas are set in and are the result of complex processes which are often politically rather than rationally decided (Rutkowski et al., 2020). ILSAs are only one factor among many that influence policymaking, and isolating their effect is near impossible. Verger and Parcerisa (2017) have attempted to understand the potential impact of PISA by analyzing six editions of OECD/PISA reports and find that school autonomy with accountability measures represented one of the policy recommendations consistently included in these reports. In a study by Breakspear (2012), 29 of 37 OECD country representatives admitted that PISA/OECD recommendations on accountability have influenced accountability reforms at their national level. Such accountability reforms have, according to Lewis (2016), increasingly affected curriculum, pedagogy, and the experiences of students and teachers, particularly when countries’ initial performance on the test is lower than expected, a phenomenon which is often referred to as “PISA shock.” Germany, for example, experienced such a shock in 2000 with, according to Breakspear (2014) and Gruber (2006), a “tsunami” of political and media responses to what it considered a deplorable position in the league tables. The shock led, according to Gruber (2006), to a host of in-service teacher training, improvement of language teaching in preschool, special support for immigrant children, measures to improve the diagnostic and methodological competence of teachers, and programs to modify school buildings for whole-day use. Waldow (2009) however also references empirical work by Tillmann and colleagues who show that many of the decisions about introducing standards, assessment, and exams had already been taken before PISA. The disappointing outcome on the international assessments merely provided further political legitimacy to implement these reforms where education ministries justified their decisions with reference to PISA. Interestingly, “shocks” from international rankings are particularly attributed to PISA results rather than other international large-scale assessments, according to Rutkowski et al. (2020). These authors explain how Germany, Norway, and Japan had participated in the TIMSS assessment 5 years prior to PISA and had similar results on TIMSS (in terms of relative rankings), but these resulted in significantly less public discourse and little policy action. They explain the lack of public or political uproar from TIMSS by the potential lack of appetite and/or political muscle of the IEA – responsible for TIMSS – to influence policy debates. Compared to the OECD’s PISA, the IEA leaves most of the discussions of results to the academic community according to these authors. The reflection highlights how the wider context in which international large-scale assessments are discussed and communicated is important for numbers and league tables to create the type of self-fulfilling prophecy Espeland and Sauder (2007) referred to. PISA for Schools, introduced in 2010, may provide further leverage for aligning school quality to international metrics of educational quality as the next section argues.

108

M. Ehren

Penetrating Schools and Classrooms: PISA for Schools The development of PISA for Schools started in 2010, when the OECD commissioned the Australian Council for Educational Research (ACER) to develop test items for a school-level PISA that aligned with existing PISA assessment frameworks for reading, mathematics, and science and which could be evaluated against existing PISA scales (i.e., achievement levels 1–6). PISA for Schools includes an assessment of student performance and a student and principal questionnaire about “in-school” and “out-ofschool” influences, such as classroom climate and student attitudes. An initial pilot study in 2012 included schools in the USA and Canada (province of Manitoba) and a second pilot in Spain in 2013–2014. After the official launch in the USA in April 2013, PISA for Schools is now available in Andorra, Brazil, Brunei Darussalam, People’s Republic of China, Colombia, Mexico, Russian Federation, Spain, the United Arab Emirates, the UK, and the USA (https://www.oecd.org/pisa/ pisa-for-schools/). Schools or districts can join PISA for Schools by first expressing their interest to the OECD secretariat which will then obtain approval to offer the PISA-based Test for Schools from the relevant PISA Governing Board Representative. Once the representative has agreed, the school, district, or other public authority wanting to administer the test signs an agreement with the OECD and a contract with an accredited national service provider who delivers the test within the framework and reporting template set by the OECD. The rapid uptake of PISA for Schools – 12 countries/regions in 7 years after the pilot – suggests a similar appeal as the main PISA. Motivations for participation are however different given that it’s not country-level representatives who decide to join. A study by Lewis (2017) shows how PISA for Schools was welcomed in the USA as an alternative to top-down testing and accountability. Here, local educators perceived the test as a more superior alternative to what they saw as flawed regimes in their own country. For these educators, PISA for Schools allowed an exit from their state-level standardized assessments. Similarly in Portugal, schools and municipal authorities are planning to introduce PISA for Schools as a way to measure competences that are not currently captured in national assessments. PISA for Schools, unlike the main test, measures and compares individual school performance against national and subnational schooling systems. School-level performance data and examples of “best practice” are made available to participating schools via a 160-page report with a set of 17 examples, such as the one below (OECD, 2012, p. 83; taken from Lewis, 2016): Students from disadvantaged backgrounds who take one hour extra of regular science classes are 1.27 times more likely to be resilient than other disadvantaged students who do not have this opportunity. . . Therefore, introducing compulsory science classes such as physics, biology and chemistry into the core curriculum of disadvantaged students might help close the performance gap with students that come from more advantageous backgrounds.

6

International Large-Scale Assessments and Education System Reform

109

Any adaptations to the school report template (including translation into languages other than English or the modification of tables or figures) must first be submitted by the national accredited provider to the OECD for approval before being released to participating schools (Lewis, 2016). Such practice promotes the benchmarking of schools against other schools internationally and promotes a set of best practices on the school level. Schools are, for the first time, able to interact directly with the policies and discourses of the OECD, without the intervening presence of government, allowing the OECD to position itself also as a “local expert” and directly intervene into more local schooling spaces, rather than focus on the global and nation state level only. Such interactions are further promoted via an online platform which is currently being developed. The online platform “Match My School” enables participating schools to connect globally and allows schools that have taken the test to share their results and experiences and to contact one another to exchange information about teaching practices. According to the OECD (2019), the platform promotes peer-to-peer learning and provides a “secure space” to learn about teaching practices. (Brochure “PISA for Schools”; see also http://www.oecd.org/pisa/pisa-forschools.) Such learning is however particularly aimed at improving the outcomes on the PISA test, given that the OECD provides all schools with a set of the same examples, promoting a school-level borrowing of policies and practices from schools across the globe. In offering these school-level practices, the OECD extended its reach beyond defining the types of policies and governance models which are best practice in improving national performance levels and into the governance of local schooling. Even though the OECD was quick to point out that it does not tell participating countries or schools how to run their education system, or school, the legitimacy and status of the OECD and dominance of its policy voice arguably restrict the materials, examples, and interpretations educators will access, particularly given the limited time teachers and principals have to look for and interpret best practice. As Lewis (2016) points out, PISA for Schools has the potential to enhance the influence of international organizations and assessments on the classroom by changing relations, specialities, and modes of governance between the OECD, nation states, and participating schools and districts. By having a more direct relationship with schools and districts, the OECD can more easily circumvent government policy in defining what schools should strive toward and how they should attain such goals.

An Alternative Model: The OECD’s SEG Framework As this chapter has highlighted, international assessments and the more recent development of PISA for Schools have been criticized increasingly for their influence on national policy and schools. (The author of this chapter has advised the SEG project team on multiple occasions.) Breakspear (2014) argues that PISA has limited

110

M. Ehren

countries in their view of what matters educationally, restricting reforms to focus on academic outcomes and evaluating these reforms on the basis of PISA benchmarks. The open letter to the OECD of 136 academics in 2014 (https://journals.sagepub.com/doi/ pdf/10.2304/pfie.2014.12.7.872) summarized these concerns and how the continuous cycle of global testing negatively affects students’ well-being, reduces the diversity of innovative approaches to education, and essentially promotes system homogeneity. Whether and to what extent this is the case is unclear, but these concerns highlight the need for the creation of alternative methods for international comparison which are more context-sensitive and allow countries to learn in a more holistic manner about how to improve their education systems in relation to country-specific priorities. An example of such an approach from within the OECD is the Strategic Education Governance (SEG) project. SEG offers a framework, where policymakers can learn about their own system through discussions with stakeholders and against a set of local objectives, instead of via the global comparison of international assessments which often does not translate well into local contexts. As Claire Shewbridge, the current coordinator of the SEG project team, explains (p. 3) (https://www.oecd.org/ education/ceri/SEG-Project-Plan-org-framework.pdf): (. . .) the Strategic Education Governance project aims to develop tools for policy makers to focus attention on elements of effective governance. An important part of this exploratory work is to develop an organisational framework for strategic education governance, which is normative in nature. The intention is to review this framework, to identify priority areas and to develop within these priority areas a set of actionable and ‘aspirational’ indicators that support strategic education governance.

The SEG project is the successor to the Governing Complex Education Systems (GCES), which ran from 2011 to 2016. GCES resulted in two reports with a conceptual framework which identified key elements of effective governance systems and particularly of complex, decentralized systems with multiple actors. The lessons put forward in Burns and Köster (2016) and Burns et al. (2016) are summarized in five points, outlining that effective systems: 1. 2. 3. 4. 5.

Focus on effective processes, not on structures. Are flexible as well as adaptive to change and uncertainty. Build capacity, engage in open dialogue, and involve stakeholders. Pursue a whole-of-system approach. Integrate evidence, knowledge, and the use of data to improve policymaking and implementation.

The current SEG project builds on these five points by developing a conceptual framework with a set of policy tools which support countries in developing more effective governance. It takes complexity theory as a starting point in proposing an organizational framework comprised of six domains as depicted in the below figure (Taken from: http://www.oecd.org/education/ceri/strategic-education-governanceorganisational-framework.htm):

6

International Large-Scale Assessments and Education System Reform

111

As explained on the website of the OECD (http://www.oecd.org/education/ceri/ strategic-education-governance-organisational-framework.htm): • Strategic thinking: balancing short-term priorities with long-term perspectives in a context in which effective policy strategies emerge and evolve based on new information and system dynamics • Accountability: organizing who renders an account to whom and for what an account is rendered, shaping incentives and disincentives for behavior • Capacity: ensuring decision-makers, organizations, and systems have the adequate resources and competencies to fulfill their roles and tasks • Whole-of-system perspective: adopting perspectives reaching beyond individual realms of responsibility to coordinate across decision-makers, governance levels, and policies • Stakeholder involvement: helping involve stakeholders throughout the policy process in policymaking and practice of governance, in turn building support and increasing relevance and suitability of policy for stakeholders • Knowledge governance: stimulating the production of relevant knowledge and promoting its use in decision-making At present, only a module on “knowledge governance” is operational and offered to countries who can sign up for a case study-based review with stakeholder reflection workshops and additional learning seminars.

Case Studies A case study starts with agreeing on the scope of the study within one or more of the modules in the organizational framework in Fig. 1 (currently only knowledge governance) and a mapping of key stakeholders by a country representative and the OECD team. Stakeholders may be internal as well as external to an organization, depending on the agreed scope of the case study. The case studies follow an analytic-empirical approach and include primary and secondary research. The main methods of data collection are interviews with organizations’ delegations conducted by the OECD team on the basis of a questionnaire. The questionnaire aims to investigate specific processes or functional areas within organizations and facilitate stakeholder engagement in the diagnostic process. They encourage reflection on processes and expand knowledge by inviting information about local practices and approaches. They also interrogate whether particular efforts are carried out within an organization and collect information about how efforts materialize as practices in specific contexts and identify approaches worth sharing across peer contexts. Depending on the agreed scope, the team will conduct 1–2 weeks of field work, resulting in an analytical report (ca. 60–80 pages). The outcomes of the questionnaires and the report are used in facilitated discussions in a stakeholder reflection seminar (1 day), bringing together key stakeholders

112

M. Ehren

Fig. 1 An organisational framework for strategic education governance

to collectively reflect on, refine, and add to the presented analysis. Engagement of a broader set of stakeholders aims to heighten the collective learning from the case study.

Learning Seminars A second element of the policy tools offered as part of the framework is learning seminars, hosted by a country. These seminars are small-scale gatherings of countries/systems (including key stakeholders within the participating countries) around one of the modules in the framework. The seminars provide a forum to further develop the policy toolkit while also allowing participants to draw lessons for governance practice. In each seminar, participants focus on a specific policy of the given host-system and participate in interactive discussions and facilitated reflection. “SEG learning seminars generate new knowledge of the ‘how to’ of policy implementation and make tacit knowledge explicit by investigating design and implications of implementation strategies” (Meyer and Zahedi 2014). Over the next years, the OECD SEG team aims to develop similar questionnaires and tools for all six modules. These modules are offered as stand-alone tools, allowing countries to sign up for individual or multiple modules, depending on their capacity, including the available time and resources of stakeholders who are invited to participate in the case studies. As the OECD explains, the toolkit, when

6

International Large-Scale Assessments and Education System Reform

113

fully developed, focuses on in-country learning, taking stock of local practices and allowing countries and stakeholders to find out how they are doing against a set of self-defined aspirational items in one or more of the domains in the framework. The approach distinguishes itself from the centralized framework and collection of data in PISA in choosing a participatory approach in diagnosing system functioning and areas for improvement. Allowing countries to decide on priority areas of the review and engaging a multitude of stakeholders in the diagnosis is thought to: • Promote a common language for stakeholder dialogue across varied contexts. • Guide stakeholders’ independent reflection and dialogue concerning governance processes. • Nurture support for improving governance processes through comprehension and engagement. The ambition of the OECD SEG team is that this will allow participants to learn about effective or innovative policy practices in other countries, about contextspecific barriers and enablers for improvement in their own country, and identify governance options and possible trajectories for future action.

A Holistic Approach to International Comparisons? To what extent will the chosen approach of case studies and learning seminars enable the OECD SEG team to meet these aims and address some of the criticism of the crude comparison made in other types of international assessments, such as the OECD’s PISA? One of the key barriers to overcome, as this chapter highlighted, is the appeal of numbers and comparative benchmarks and their ability to reduce messy information into seemingly easy to implement “good practices.” This requires a change in mindset in participating countries who need to be willing to move away from the reductionist habit and mindsets that underpin country rankings from international large-scale assessments and the easy narrative of comparison on a small set of numbers. Policymakers need to embrace the variability and uncertainty of education system reform, instead of understanding their education system as a collection of separate conditions and policies which have a clear cause and effect relationship with learning outcomes. To work with such complexity and learn about improvement, policymakers require a “habit of mind” of openness, situational awareness, where they balance on the one hand the risks associated with practicing restraint while on the other hand taking action to move forward (Rogers et al., 2013). By openness, Rogers et al. (2013) mean a willingness to accept, engage with, and internalize the different perspectives and paradigms to be encountered when dealing with diverse participants in an interdisciplinary situation. Openness is particularly relevant where countries engage multiple stakeholders in their policy process to learn about how to improve the governance of their education systems. Views of parents, teachers, or school boards are likely going to diverge from those in national government; only when all stakeholders involved in the country case study are able

114

M. Ehren

and willing to navigate these different views and engage with opinions of others will a country be able to learn about opportunities for improvement and be able to experiment with and cocreate new approaches. Situational awareness is also a key condition for complexity-based thinking; stakeholders need to acknowledge the importance of context and scale where interventions and good practice can have quite different outcomes across different countries and even regions within countries. These not only include spatial and historical contexts but also differences in value systems of those working within education systems. An awareness of these differences and how they shape schooling practices and learning outcomes is relevant to understanding what good governance in a specific context entails and how to improve it. Such habit of mind requires policymakers and stakeholders to often reflect, cultivate feedback mechanism and networks, and have an “anticipatory awareness,” according to Rogers et al. (2013). The habit is particularly relevant for multiple stakeholders to share their views and understandings around various aspects of their country’s education system’s performance. When interpreting and discussing examples from other countries and deciding on ways to improve their one’s own system, stakeholders need to be willing to adapt examples to their own context. Such situational awareness requires patience and a willingness to allow for the emergence of ideas and opportunities, instead of implementing the type of “ready-to-go solutions” that are often offered in briefings and good practice examples published from the outcomes of international large-scale assessments referenced in previous sections of this chapter. The patience advocated by Rogers et al. (2013) however also needs to be balanced with taking action and the ability to balance appropriate restrained and action. According to these authors, working in complex systems on the one hand requires the creation of space to allow for the emergence of ideas, trust, and opportunity, such as purposefully staged in the previously described SEG case studies and learning seminars. On the other hand, stakeholders also need to have the courage to take action in a context of uncertainty. As Shewbridge and Köster (2019) explain, in a complex system, the consequences of certain actions are never entirely predictable; only when improvement efforts are tried out small and local before large-scale implementation, accompanied with feedback from a variety of sources and the ability to quickly adapt to it, will systems improve. International large-scale assessments can be one of those sources when used responsibly and taking into account the limitations of the data. Policymakers and stakeholders need to be conscious of and comfortable with this paradigm and approach to be able to succeed.

Conclusions This chapter outlined the appeal of international assessments and league tables and outlined an example of a more holistic approach to reviewing the performance of education systems. In this final conclusion, we turn back to our discussion of

6

International Large-Scale Assessments and Education System Reform

115

international large-scale assessments and ask how these can and cannot be used to deepen our understanding of the policies and practices that foster educational reform. In the introduction, we briefly referred to the methodological limitations of comparing countries to understand good practice. As Klieme (2020) explains, designs that are fit for studying the effectiveness of specific policies and practices typically require experimental or at least quasi-experimental designs. The cross-sectional nature of data from international large-scale assessments and the lack of cross-cultural comparability do not lend itself well for an estimation of such effects; cross-sectional data may at best indicate variance in achievement between schools, but such variation does not indicate that higher achievement levels are caused by certain policies or practices that are measured at the same time (Klieme, 2020). Oftentimes, the league tables are used to identify effective policies from the countries scoring at the top of the table. Such comparisons are equally unsuitable to understand effective reform given the fact that the positioning of countries in the rank order varies by which countries decide to participate in a particular year. Also, neither PISA nor TIMSS provide a single, universal standard of quality by which schools can be judged and the variation in student performance on both assessments is mostly within countries, not across them, making a cross-country comparison to identify effective policies inherently flawed. Despite these limitations, international large-scale assessments do have potential for improving our understanding of effective education systems, particularly when analyzing performance within countries. Klieme (2020) points to longitudinal crosscohort designs to analyze policies that are implemented on the national level using multiple waves of data collection for potential discontinuities in the data related to the introduction of reforms. However, the outcomes of such analysis should also be treated with caution given potential flaws in linking test items over time (Rutkowski and Rutkowski, 2016). Data from international large-scale assessments can according to Klieme (2020) and Pizmony-Levy et al. (2014) also be analyzed to better understand differential access to and distribution of educational opportunities across students, families, schools, and regions. Such comparisons across jurisdictions of the variances in test scores, of the gradients of test scores on socioeconomic status, or of gaps between immigrants and native born students can be informative and answer questions such as: – Do migrant students and students from socially disadvantaged families have an equal share of well-trained teachers; engaged school principals; well-ordered, supportive, and challenging classroom environments; and out-of-class learning opportunities? – Who receives differentiated instruction, supportive feedback, and direct guidance from his or her teachers? – Which schools report policies for assessment and evaluation, and which don’t? – Does student truancy and attention in classroom differ between subpopulations? International large-scale assessments can deepen understanding of the policies and practices that foster educational reform, but only when policymakers are

116

M. Ehren

willing to look beyond the simplistic headlines that often run in the media after publication of the rankings and when actively engaging with the evidence to understand the type of claims the assessment and resulting data can support (Rutkowski et al., 2020). When analyzing PISA scores for policy-making purposes, we for example need to understand that PISA does not measure students’ mastery of national curricula and that low outcomes can be explained by a mismatch between was is taught in schools and the skills measured on the test. Improving education systems requires deep in-country learning and agency of multiple stakeholders working in education systems to improve the governance and outcomes of their system. International large-scale assessments are, in the end, only a set of questionnaires which allow for diagnosis and review but will not lead to actual change without active engagement. Changing mindsets to deal with complexity and allow for stakeholder involvement and voice is where the real challenge lies for effective education reform.

References Addey, C., & Sellar, S. (2019). Rationales for (non) participation in international large-scale learning assessments. UNESCO Working paper ED-2019/WP/2. Auld, E., & Morris, P. (2014). Comparative education, the ‘New Paradigm’ and policy borrowing: constructing knowledge for educational reform. Comparative Education, 50(2), 129–155. https://doi.org/10.1080/03050068.2013.826497 Breakspear, S. (2014, November). How does PISA shape education policy making? Why how we measure learning determines what counts in education. In Centre for Strategic Education Seminar Series Paper (Vol. 40). Burns, T., & Köster, F. (2016). Governing education in a complex world. Educational Research and Innovation. OECD Publishing. Burns, B., Köster, F., & Fuster, M. (2016). Educational research and innovation education governance in action lessons from case studies: Lessons from case studies. OECD Publishing. Carvalho, L. M., & Costa, E. (2015). Seeing education with one’s own eyes and through PISA lenses: Considerations of the reception of PISA in European countries. Discourse: Studies in the Cultural Politics of Education, 36(5), 638–646. https://doi.org/10.1080/01596306.2013. 871449 Chabbot, C., & Elliott, E. (Eds.). (2003). Understanding others, educating ourselves: Getting more from international comparative studies in education. National Academies Press. Desrosières, A. (1998). The politics of large numbers: A history of statistical reasoning. Harvard University Press. Diaz-Bone, R., & Didier, E. (2016). The sociology of quantification – Perspectives on an emerging field in the social sciences. Historical Social Research, 41(2), 7–26. https://doi.org/10.12759/ hsr.41.2016.2.7-26 Espeland, W., & Sauder, M. (2007). Rankings and reactivity: How public measures recreate social worlds. American Journal of Sociology, 113(1), 1–40. https://doi.org/10.1086/517897 Espeland, W., & Stevens, M. L. (1998). Commensuration as a social process. Annual Review of Sociology, 24, 313–343. Espeland, W. N., & Stevens, M. L. (2008). A sociology of quantification. European Journal of Sociology/Archives Européennes de Sociologie, 49(3), 401–436. https://doi.org/10.1017/ S0003975609000150

6

International Large-Scale Assessments and Education System Reform

117

Ferrer, G., & Fiszbein, A. (2015). What has happened with learning assessment systems in Latin America? In Lessons from the last decade of experience. The World Bank. Frey, B. S., Homberg, F., & Osterloh, M. (2013). Organizational control systems and pay-forperformance in the public service. Organization Studies, 34(7), 949–972. https://doi.org/10. 1177/0170840613483655 Gregory, A. J. (2007). Target setting, lean systems and viable systems: a systems perspective on control and performance measurement. Journal of the Operational Research Society, 58(11), 1503–1517. https://doi.org/10.1057/palgrave.jors.2602319 Grek, S. (2009). Governing by numbers: the PISA ‘effect’ in Europe. Journal of Education Policy, 24(1), 23–37. https://doi.org/10.1080/02680930802412669 Gruber, K. H. (2006, May). The German ‘PISA-Shock’: some aspects of the extraordinary impact of the OECD’s PISA study on the German education system. In Cross-national attraction in education accounts from England and Germany. Symposium Books. Hegarty, S., & Finlay, S. (2020). Publications and dissemination. In Reliability and validity of international large-scale assessment (pp. 221–230). Springer publishers. Hopfenbeck, T., Lenkeit, J., El Masri, Y., Cantrell, K., Ryan, J., & Baird, J. (2017). Lessons learned from PISA: A systematic review of Peer Reviewed Articles on the Programme for International Student Assessment. Scandinavian Journal of Educational Research. https://doi.org/10.1080/ 00313831.2016.1258726 Klieme, E. (2020). Policies and practices of assessment: A showcase for the use (and misuse) of international large scale assessments in educational effectiveness research. In International perspectives in educational effectiveness research (pp. 147–181). Springer. Lewis, S. (2016). Governing schooling through ‘what works’: the OECD’s PISA for Schools. Journal of Education Policy. https://doi.org/10.1080/02680939.2016.1252855 Lewis, S. (2017). Communities of practice and PISA for Schools: Comparative learning or a mode of educational governance? Education Policy Analysis Archives, 25(92). https://doi.org/10. 14507/epaa.25.2901 Lockheed, M., Prokic-Bruer, T., & Shadrova, A. (2015). The experience of middle-income countries participating in PISA 2000–2015 (PISA series). The World Bank & OECD Publishing. https:// doi.org/10.1787/9789264246195-en Meyer, H. D., & Zahedi, K. (2014). An Open Letter: To Andreas Schleicher, OECD, Paris. Mitteilungen der Gesellschaft für Didaktik der Mathematik, 40(97), 31–36. Morgan, C., & Volante, L. (2016). A review of the Organisation for Economic Cooperation and Development’s international education surveys: Governance, human capital discourses, and policy debates. Policy Futures in Education, 14(6), 775–792. https://doi.org/10.1177/ 1478210316652024 Niyozov, & Hughes. (2019). http://theconversation.com/problems-with-pisa-why-canadiansshould-be-skeptical-of-the-global-test-118096 OECD. (2012). How your school compares internationally: OECD test for schools (Based on PISA) Pilot Trial [US Version]. OECD. http://www.oecd.org/pisa/aboutpisa/pisa-based-test-forschools-assessment.htm Pizmony-Levy, O., Harvey, J., Schmidt, W. H., Noonan, R., Engel, L., Feuer, M. J., & Chatterji, M. (2014). On the merits of, and myths about, international assessments. Quality Assurance in Education. https://doi.org/10.1108/QAE-07-2014-0035 Rogers, K. H., Luton, R., Biggs, H., Biggs, R., Blignaut, S., Choles, A. G., Palmer, C. G., & Tangwe, P. (2013). Fostering complexity thinking in action research for change in social– ecological systems. Ecology and Society, 18(2), 31. https://doi.org/10.5751/ES-05330-180231 Rutkowski, D. J. (2007). Converging us softly: how intergovernmental organizations promote neoliberal educational policy. Critical Studies in Education, 48(2), 229–247. https://doi.org/ 10.1080/17508480701494259 Rutkowski, L., & Rutkowski, D. (2016). A call for a more measured approach to reporting and interpreting PISA results. Educational Researcher, 45(4), 252–257. https://doi.org/10.3102/ 0013189X16649961

118

M. Ehren

Rutkowski, D., Thompson, G., & Rutkowski, L. (2020). Understanding the policy influence of international large-scale assessments in education. In Reliability and validity of international large-scale assessment, 261. https://doi.org/10.1007/978-3-030-53081-5_15 Sauder, M., & Lancaster, R. (2006). Do rankings matter? The effects of US News & World Report rankings on the admissions process of law schools. Law & Society Review, 40(1), 105–134. https://doi.org/10.1111/j.1540-5893.2006.00261.x Sellar, S., & Lingard, B. (2018). International large-scale assessments, affective worlds and policy impacts in education. International Journal of Qualitative Studies in Education, 31(5), 367–381 Shewbridge, C., & Köster, F. (2019). Strategic education governance; policy toolkit, design. OECD. Sjøberg, S. (2015). PISA and global educational governance-a critique of the project, its uses and implications. Eurasia Journal of Mathematics, Science & Technology Education, 11(1). https:// doi.org/10.12973/eurasia.2015.1310a Van Petegem, P., & Vanhoof, J. (2004). Feedback over schoolprestatie-indicatoren als strategisch instrument voor schoolontwikkeling? Lessen uit twee Vlaamse cases [Feedback of performance indicators as a strategic instrument for school improvement? Lessons from two Flemish cases]. Pedagogische Studiën, 81(5), 338–353. Verger, A., & Parcerisa, L. (2017). Accountability and education in the post-2015 scenario: International trends, enactment dynamics and socio-educational effects. Paper commissioned for the 2017/8 Global Education Monitoring Report, Accountability in education: Meeting our commitments. Paris: UNESCO. ED/GEMR/MRT/2017/P1/1/REV. Verger, A., Fontdevila, C., & Parcerisa, L. (2019). Reforming governance through policy instruments: how and to what extent standards, tests and accountability in education spread worldwide. Discourse: Studies in the Cultural Politics of Education. https://doi.org/10.1080/ 01596306.2019.1569882 Wagemaker, H. (2020). Introduction to reliability and validity of international large-scale assessment. In Reliability and validity of international large-scale assessment (pp. 1–5). Springer publishers. Waldow, F. (2009). What PISA did and did not do: Germany after the ‘PISA-shock’. European Educational Research Journal, 8(3), 476–483. https://doi.org/10.2304/eerj.2009.8.3.476 Wiseman, A. (2013). Policy responses to PISA in comparative perspective. In H. D. Meyer & A. Benavot (Eds.), PISA, power, and policy. The emergence of global educational governance. Symposium Books.

7

The Role of International Large-Scale Assessments (ILSAs) in Economically Developing Countries Syeda Kashfee Ahmed, Michelle Belisle, Elizabeth Cassity, Tim Friedman, Petra Lietz, and Jeaniene Spink

Contents Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evidence-Based Policy-Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Factors Influencing the Use of Evidence in Policy-Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Special Case of ILSAs: Two Regional Large-Scale Assessment Programs . . . . . . . . . . . . . . . . . Pacific Islands Literacy and Numeracy Assessment (PILNA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Brief Overview of PILNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How Is PILNA Integrated into the Policy-Making Process of Participating Countries? . . . What Is the Role of Capacity Building Activities and Technical Quality in PILNA? . . . . . What Does PILNA’s Strategy Look like in Terms of Access and Results Dissemination to Stakeholders, Including the Media? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Southeast Asia Primary Learning Metrics (SEA-PLM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Brief Overview of SEA-PLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What Is the Role of Capacity Building Activities and Technical Quality in SEA-PLM? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How Is SEA-PLM Integrated into the Policy-Making Process of Participating Countries? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What Does SEA-PLM’s Strategy Look like in Terms of Access and Results Dissemination to Stakeholders, Including the Media? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

120 120 120 123 124 125 125 126 128 132 133 133 134 134 135 137 139

S. K. Ahmed (*) · M. Belisle · E. Cassity · T. Friedman · P. Lietz · J. Spink EMR, ACER, Adelaide, Australia e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_7

119

120

S. K. Ahmed et al.

Abstract

Increasingly, economically developing countries are using large-scale assessments (LSAs) frequently to monitor progress toward the sustainable development goals for the 2030 education agenda. In this chapter, the two regional assessment programs, namely the Pacific Island Literacy and Numeracy Assessment (PILNA) and the South East Asia Primary Learning Metric (SEA-PLM), are discussed in terms of the key factors that contribute toward uptake of evidence from LSAs in the education policy-making cycle. The objective is to understand the applicability and usefulness of LSAs in these contexts. Keywords

Evidence-based policy making (EBP) · International large-scale assessments (ILSAs) in economically developing countries · Policy-making cycle in education

Chapter Summary The number of countries undertaking international and regional large-scale assessments (LSAs) is increasing. Much of this growth has occurred in economically developing countries, with a view not only to monitoring progress toward the sustainable development goals for the 2030 education agenda (UNESCO, 2015) but also to providing information for evidence-based policy and decision-making about education inputs and processes, with a view to the continuous improvement of learning outcomes (OECD, 2019b). This chapter explores the role of international large-scale assessments (ILSAs) in low- and middle-income countries (LMICs) by starting with a brief introduction to the concept of evidence-based policy-making in general, in economically developing countries and in education. It then looks at research into factors that have been identified to influence the uptake of evidence from large-scale assessments in the policy-making cycle in education. These factors are then used to discuss two regional assessments, namely, the Pacific Island Literacy and Numeracy Assessment (PILNA) and the South East Asia Primary Learning Metric (SEA-PLM). The conclusion provides considerations regarding the usefulness of the identified factors and reflections on current and future challenges of PILNA and SEA-PLM for EBP.

Background Evidence-Based Policy-Making The idea of evidence-based policy-making (EBP) can be traced through the centuries from the mid-fourteenth century where the use of evidence was sought to give more structure to the erratic decision-making behavior of despots to Florence Nightingale

7

The Role of International Large-Scale Assessments (ILSAs) in. . .

121

Table 1 Key issues regarding the use of evidence in the policy-making process Stage of the policy process Agenda setting

Policy Formulation

Policy Implementation

Monitoring and Policy Evaluation

Key issues regarding the use of evidence Evidence is useful for identifying new problems or to understand the extent of a problem and to create awareness among the relevant policy actors. The credibility of evidence and the way it is delivered is critical here Evidence will help policymakers understand the scope and the options to be able to make informed decisions about which policy to go ahead and implement. The size and reliability of the evidence is important at this stage Operational evidence is needed to support sound implementation and the effectiveness of initiatives. Systematic learnings around technical skills, expert knowledge, and practical experience are all valuable for this stage Evidence is essential to develop effective monitoring tools and therefore the evidence needs to be objective, detailed and appropriate, and collected regularly for throughout the policy process

Source: Adapted from Sutcliffe and Court (2005)

who criticized the British Government in the 1920s for not taking into consideration evidence from the past and the present when changing policy (Banks, 2010). More recently, EBP has been defined as “a discourse or set of methods which informs the processes by which policies are formulated, rather than aiming to affect the eventual goals of the policy” (Sutcliffe & Court, 2005; p 1). This focus on process rather than the outcomes is reflected by Young and Quinn (2002) who divided it into four parts, namely, agenda setting, policy formulation, policy implementation and monitoring, and policy evaluation. Sutcliffe and Court (2005) have further analyzed these stages and identified some key issues regarding the uses of evidence at the various policy stages (see Table 1). In the field of Education, EBP is often used to support new reforms and systemslevel changes made by the educational policy makers. Examples of this can be seen across many of the high performing education systems. For instance, in South Korea, the national system-wide assessment data are frequently used for policy initiatives to remain academically competitive. Thus, Korean students whose annual National Assessment of Education Achievement (NAEA) scores are below the “Below Basic level” are regularly identified and the Ministry of Education has implemented a policy for supporting low-achieving students by providing financial and administrative support for schools with Below Basic-level students (Kim et al., 2016). Still, evidence-based policy-making is often considered to be more critical for LMICs than high-income countries as the effective use of evidence in policy and practice can potentially reduce poverty, improve health, education and quality of living, and support social development and economic growth in these countries (Sutcliffe & Court, 2005). In addition, some LMICs use EBP for supporting their education sector goals. A noteworthy example is the use of international assessment evidence by Chile since the 1960s, starting with the country’s participation in IEA’s Six Subject

122

S. K. Ahmed et al.

Survey and subsequent ILSAs. Information from these ILSAs have influenced Chile’s education policy by providing key data for decision-makers for different purposes such as curricular reforms, educational system-level planning, and strategy implementation as well as national assessment reform (Cox & Meckes, 2016). One of the main rationales for participation in ILSAs is that educational policymakers look for strategies that have worked well elsewhere and appear as the best policy for their particular conditions (Addey et al., 2017). Other key reasons to participate in ILSAs include a desire by policymakers to follow the norm as members of international communities with shared values (Addey et al., 2017) and the use of global rankings for educational policy agenda setting and policy target monitoring (Addey et al., 2017; Fischman et al., 2017). In fact, many education systems participate in ILSAs for the status it brings and the opportunity to benchmark against the high-performing groups of affluent OECD countries (Kamens, 2013; Addey et al., 2017). In addition, much of the growth in international large-scale assessments (ILSAs) in recent years has occurred in economically developing countries (Wagner et al., 2012). While these countries frequently use ILSAs to monitor progress toward the sustainable development goals for the 2030 education agenda (UNESCO, 2015) they are also interested in gathering information for evidencebased policy and decision-making about education inputs and processes for the continuous improvement of learning outcomes (Lockheed, 2012; Wagner et al., 2012). Moreover, development partners, which are also committed to the concept of EBP (e.g., DFAT, 2015; DFID, 2013; Jones, 2012; USAID, 2019), expect robust evidence particularly applicable to economically developing countries, yet systemic issues with data collection frequently provide a challenge to this approach. Thus, it is not surprising that according to the 2018 UNESCO Institute for Statistics (UIS) report Learning divides: using data to inform educational policy a large number of LMICs are yet to participate in an international assessment, although the results of such comparative ILSAs are highly likely to influence a country’s political will to increase education-sector budgets (Singer & Braun, 2018; Willms, 2018). In this context, it is important to note that current ILSAs are relatively limited in terms of the extent to which they collect information, which is relevant to policymakers in those countries (Cresswell et al., 2015). First, the cognitive tests of the subject areas are not well targeted to assess and describe the whole range of student ability in those countries in a meaningful way. Second, this lower variability in outcome measures limits the possibility of exploring factors – for which information is usually obtained through contextual questionnaires administered to students, parents, teachers, and principals – aimed at explaining differences in student ability (Lockheed & Wagemaker, 2013). Moreover, while some customization of contextual instruments can occur in international large-scale assessments, it is often minimal. This is because questionnaire content determined by administration time constraints, the

7

The Role of International Large-Scale Assessments (ILSAs) in. . .

123

need to keep questions stable to enable comparisons over time, and policy ranking exercises in which questions that are relevant to only a smaller number of countries and with smaller populations frequently do not make the cut-off points for inclusion. Still, evidence from LMICs suggest that data from large-scale assessments of students’ learning are frequently used to inform system-level policies regarding curriculum standards and reforms, performance standards, and assessment policies, all aimed at improving student outcomes, reducing inequalities, and strengthening foundations (Willms, 2018). Thus, for example, Iran used the Trends in International Mathematics and Science Study (TIMSS) curricular framework for its own assessment (Heyneman & Lee, 2014) while Kyrgyzstan’s participation in the Programme for International Student Assessment (PISA) informed the government’s ongoing curricular reforms in 2009 and 2010 (Shamatov & Sainazarov, 2010). However, evidence for the use of large-scale assessment data in teaching and learning policies, aimed at specific school- and classroom-level practices, is still emerging (Best et al., 2013). One such use of assessment data for improving policies for in-class learning strategies has been found in Malaysia where the number of in-class science experiments and computer uses were increased as a result of the findings from TIMMS 2003 (Gilmore, 2005). Other countries, such as Mexico, have used data from PISA 2015 to develop “school resource plots” covering areas such as inclusion, quality of instruction, learning time, material resources, and family support, which are valuable for planning policies about the allocation of resources (Willms, 2018).

Factors Influencing the Use of Evidence in Policy-Making It is noteworthy that while there is some indication of large-scale assessments informing education policy (e.g., Breakespear, 2012; Gilmore, 2005; Lietz et al., 2016; Tobin et al., 2015), the link would seem smaller than might be expected given the cost and effort of these large-scale assessments. To explore reasons for this smaller than expected link, a number of studies have examined facilitators for and barriers to the uptake of evidence in policy-making (Banks, 2010; Buckley et al., 2014; Cherney et al., 2012; Oliver et al., 2014; Orem et al., 2012). Across a number of fields – although with a strong focus on health – these studies found that the quality of evidence, its timeliness, and relevance were factors influencing the use of evidence in policy-making. In addition, collaboration and partnership as well as factoring learning events into the process were found to assist with the uptake of evidence in policy-making. In the field of education, a review by Best et al. (2013), which focused on the impact of evidence from ILSAs on policy, identified as facilitators the reliability of the assessment program, its integration with local policies, media, and public opinion and the appropriate and accurate dissemination to different stakeholders and audiences. At the same time, low quality of an assessment program and poor

124

S. K. Ahmed et al.

Integraon into policy processes Capacity building acviesTechnical quality

Impact of the internaonal large scale assessment (ILSA) programme on educaon policy

Access to and disseminaon of results to the public Fig. 1 Factors related to the impact of ILSAs on education policy

quality data, which made it difficult to undertake further, in-depth analyses, emerged as barriers. In LMICs, capacity development to assist with the quality of the assessment was particularly crucial to the confidence in the evidence and resulting uptake in policy-making. Taken together, for LMICs, these findings suggest greater impact of evidence on education policy where ILSAs (Fig. 1): (a) Are better integrated into the policy-making process of participating countries (b) Include capacity development, particularly with a view to improving the assessments’ technical quality and (c) Are accompanied by a sound dissemination strategy engaging all relevant stakeholders, including the media In the next section, two current regional assessment programs as a special case of ILSAs, namely, the Pacific Island Literacy and Numeracy Assessment (PILNA) and the South East Asia Primary Learning Metric (SEA-PLM), are discussed in terms of the factors to explore their applicability and usefulness.

A Special Case of ILSAs: Two Regional Large-Scale Assessment Programs In this section, two special cases of ILSAs in the form of regional assessment programs, which involve many economically developing countries, namely, the Pacific Island Literacy and Numeracy Assessment (PILNA) and the South

7

The Role of International Large-Scale Assessments (ILSAs) in. . .

125

East Asia Primary Learning Metric (SEA-PLM), are discussed. For each assessment program, a brief overview of the program is followed by a discussion in terms of the three factors that have been found to influence the use of evidence in education policy-making. This discussion then informs the conclusion which looks at the usefulness of considering ILSAs in terms of the identified factors.

Pacific Islands Literacy and Numeracy Assessment (PILNA) A Brief Overview of PILNA The Pacific Islands region is one of the largest and most diverse regions globally. The region is home to some 9.7 million inhabitants, with a population ranging from nearly 7.3 million residents in Papua New Guinea to approximately 1,700 residents in Niue (Pacific Community, 2019; UNESCO, 2018). There is a high level of diversity in the region’s geography, history, languages, culture, economies, and political systems (UNESCO, 2018). Despite these differences, many Pacific countries have common education challenges, and have a shared goal of improving educational achievement in literacy and numeracy (Pacific Community, 2019). A number of Pacific countries have a high youth population with around 50 % under the age of 25 (UNESCO, 2018). Pacific Island stakeholders, therefore, understand the importance of literacy and numeracy in developing foundation skills necessary to participate in all aspects of everyday life (Pacific Community, 2019). The Pacific Islands Literacy and Numeracy Assessment (PILNA) was developed by the Educational Quality and Assessment Programme (EQAP) of the Pacific Community and first administered in 2012 in 14 Pacific Island countries. Based on those findings, the 2014 Forum Education Ministers Meeting (FEdMM) requested a 2015 administration of PILNA, and supported the development of a long-term regional assessment program. PILNA is a measurement of regional standards based on a common scale that provides data on literacy and numeracy skills of students who have completed 4 and 6 years of formal primary education, respectively. It is a regional collaborative model that is highly consensual among the participating countries, providing shared intellectual capital and value for money. PILNA also reports on student, teacher, and school background. In 2018, 15 Pacific Island countries, 40,000 students, and 925 schools participated in the third administration of PILNA (The 15 countries that participated in PILNA 2018 are: Cook Islands, Federated States of Micronesia, Fiji, Kiribati, Marshall Islands, Nauru, Niue, Palau, Papua New Guinea, Samoa, Solomon Islands, Tokelau, Tonga, Tuvalu, and Vanuatu). 2021 is the fourth administration of PILNA. The overarching purpose of PILNA as a long-term, Pacific-wide regional assessment is to generate cognitive and contextual data that can be used to facilitate

126

S. K. Ahmed et al.

ongoing collaborative efforts to monitor and improve learning outcomes for children in Pacific Island countries. The PILNA program represents a commitment by Pacific Island governments and development partners to monitor the outcomes of education systems by measuring student achievement on a regular basis and within an agreed common framework. It addresses targets identified in SDG4 by providing evidence of education quality for governments, schools, and communities in the region. By building capacity through collaborative involvement of country representatives, the PILNA program helps to strengthen learning assessments, standards, and policies, while also supporting improvement in teaching and learning across the Pacific region.

How Is PILNA Integrated into the Policy-Making Process of Participating Countries? PILNA has developed as a country-owned, collaborative program since its first administration in 2012. This high level of country and regional ownership establishes improving integration of PILNA into policy processes. From 2015 forward, countries participating in PILNA have been bound by the following two conditions: 1. That participating countries are committed to sharing the results with other countries for lessons that one can learn especially from those that appear to be doing better, on good practices and policies that have been demonstrated to work. 2. That each country is committed to using the findings to carry out policy interventions as well as technical interventions – for example, classroom instructional intervention to improve learning outcomes – aimed at improving the situation in each country. In 2015, a steering group consisting of the heads of education systems from all 15 Pacific Island Forum countries was formed as the governing body of PILNA. The PILNA Steering Committee provides governance for the entire administration of each 3-year cycle of PILNA. The Steering Committee consists of the Chief Executive Officer (CEO) of each participating country’s ministry of education, the Australian Council for Educational Research (ACER) as EQAP’s technical partner, representatives of the New Zealand Ministry of Foreign Affairs and Trade (MFAT) and the Australian Department of Foreign Affairs and Trade (DFAT), and the director of EQAP. Members of the Steering Committee are able to formally represent the strategic priorities of the participating countries and engage in high-level discussions on behalf of their ministries. Moreover, the Steering Committee is able to make critical decisions about PILNA as a result of the support that it has at the highest levels of government. This process has resulted from the ownership taken by the respective CEOs, and is a step to integrate PILNA being used in countries’ policy-making agendas.

7

The Role of International Large-Scale Assessments (ILSAs) in. . .

127

PILNA is endorsed by the Pacific Forum Education Ministers’ Meeting (FEdMM) and is an integral part of the regional education architecture. Quality and relevance of education are important considerations for the education sector in the region. PILNA provides evidence to support countries in several stages of the policy cycle. The design and development of PILNA since 2015 have improved the quality and detail of data available to countries for policy- and decision-making purposes. As a technical improvement, contextual questionnaires were first piloted as part of the PILNA 2015 cycle to serve two important regional aims. The first aim was to provide contextual information about students and their families as well as about their classrooms and schools to help explain differences observed in literacy and numeracy achievement. The second aim was to collect information which – in addition to cognitive outcomes in the subject areas – can be considered to represent outcomes in their own right, such as, for example, student general attitudes toward learning or to specific subject areas. The pilot was a success and provided valuable information about the contexts for education in the region, and has informed further development work on main study items. In preparation for the 2018 cycle, the policies of participating countries in PILNA were mapped to identify content of common interest for consideration for inclusion. Policies related to assessment, curriculum, and strategic plans (to identify priority areas of the country) were considered as part of the exercise. Further consultation with country representatives led to the identification of common priorities of interest, including: • Early learning opportunities • Attendance at an Early Childhood Education and Care program (e.g., pre-school, kindergarten) • Language • Vernacular and English • Quality of instruction • Pedagogical and Assessment practice • Professional Development • Teacher certification and retention • Family and community support A similar exercise was conducted in preparation for the 2021 cycle, with content related to well-being of the school community, as well as teacher and school leader satisfaction. Additionally, the example of policy mapping described above outlines one of the consultative approaches in supporting countries to integrate PILNA into their policy-making progress. As PILNA is a regional assessment, it was critical for countries to identify common priorities of interest for the development of the questionnaires. During its first meeting in 2015, the PILNA Steering Committee adopted a consensual approach to determining the purposes and use of data from PILNA. Using a “think-pair-share” format (Think-pair-share is a technique that enables workshop participants to collaborate in developing ideas about an issue.), each

128

S. K. Ahmed et al.

member of the Steering Committee identified purposes for the use of PILNA data. Six key purposes emerged from the activity and subsequent discussion, and these are presented in Table 2 below. These purposes were reaffirmed by the Committee in 2017. The 2018 PILNA reporting cycle and a 2019 PILNA data exploration workshop (described in the following section) have demonstrated the operationalization of the purposes and use of PILNA data as first conceptualized by the PILNA Steering Committee in 2015.

What Is the Role of Capacity Building Activities and Technical Quality in PILNA? Capacity development has been an important part of PILNA since 2015. A key feature of this is a long-term technical partnership between EQAP and ACER supported by funding from the Governments of Australia and New Zealand. The quality of the assessment program has been enhanced as EQAP and ACER work closely on all technical aspects of PILNA. The other key feature of capacity development has been support and research in using PILNA data to inform education policy and programming in Pacific countries. Both technical support and workshops in using data to inform policy agendas have strengthened the impact and technical quality of PILNA through working collaboratively with governments. This collaborative approach was described previously. The following two sections provide examples of capacity development and the ongoing improvement of the technical quality of PILNA.

Research Capacity and the Use of Data Student, teacher, and head teacher questionnaires support Pacific governments in exploring contextual factors, which have been shown to be linked to literacy and numeracy results. Knowing that achievement in one of the cognitive domains is weaker than expected at the country and/or subgroup level is important, but without further evidence about factors that might be able to account for differences in achievement, issues are difficult to address. For example, the teacher questionnaire contains a large amount of content about teaching styles and classroom practices related to the teaching of literacy and numeracy. It explores teacher self-efficacy in teaching different components of these domains, and collects data on training undertaken on specific aspects of teaching literacy and numeracy. It would be expected that such data would be informative for understanding any classroom-level factors that may be impeding or fostering performance at the student level. Supporting governments to use these data adequately is an essential objective of PILNA. A collaborative approach has ensured that data collected in PILNA provides the necessary information to inform participating countries on key areas of policy. To further this idea of utilizing data collected in PILNA to meet the policy needs of countries, a PILNA data exploration workshop was conducted in October 2019 to help national representatives undertake secondary analysis of their national datasets.

Create a profile of learning outcomes based on PILNA results for countries

Plan and conduct program evaluations

Inform donors about value for money for investment Provide information for government (Cabinet)

Support long-term vision for PILNA and EQAP at donor level

Provide evidencebased information for policy-making and interventions

Review programs and support offered to schools Use assessment data to inform classroom interventions

Develop awareness at ministry level to drive support

Develop policies

Identify and design interventions to improve learning and teaching Establish and implement intervention strategies

3. Political support Drive political commitment to improve results

2. Policy Inform curriculum review, pedagogy, teacher-training institutions and education providers

1. Interventions Develop interventions to improve literacy and numeracy at system and school level

Table 2 The six key purposes and use of results for PILNA

Create a sense of ownership and responsibility for results

Provide information for schools

Provide information for parents and communities

Develop community awareness to take ownership of results

4. Community awareness Present, share, and use results with school communities and education stakeholders

Observe where participating countries sit in relation to the regional literacy and numeracy benchmarks

Observe any shifts in results since 2012

Use PILNA results to set literacy and numeracy benchmarks at district, provincial, and national levels

5. Monitoring results Encourage country ownership of data through capacity building, collection, and interrogation of results Provide a measure for tracking results

(continued)

Use PILNA results to support, validate, and improve national assessments Confirm literacy and numeracy outcomes against other national sources (e.g., NGO surveys/research, national census)

6. National validation Validate national results/data

7 The Role of International Large-Scale Assessments (ILSAs) in. . . 129

Build more accountability for data and results at all levels of the education system

Provide information for accountability

2. Policy Focus resources to improve learning outcomes

Source: Belisle et al. (2016, p. 9)

Decide areas for targeted interventions

1. Interventions Guide discussions between countries and development partners on priority interventions at country level Show gender disaggregation of results to inform interventions Inform professional development on literacy interventions

Table 2 (continued) 3. Political support Encourage crosssectoral collaboration and partnership to achieve results

4. Community awareness

Engage in crosscountry comparison

5. Monitoring results Share results with countries of similar backgrounds

6. National validation

130 S. K. Ahmed et al.

7

The Role of International Large-Scale Assessments (ILSAs) in. . .

131

The workshop was tailored toward nontechnical audiences to focus on aspects of the data that are relevant to their national needs, and enabled them to accurately report statistical outcomes based on their data using a tool that did not require prerequisite knowledge of statistical software. For many countries in the region, PILNA provides the only source of data on students’ literacy and numeracy outcomes, and how student, teacher, and school backgrounds are related to learning outcomes. For other countries, it provides a valuable additional source of data. The aforementioned data exploration workshop was fundamental in supporting countries to start using their data for policydevelopment and policy-making purposes.

Quality of the Assessment Program The quality of PILNA data has become a facilitator in improving policy processes in Pacific countries. Since 2015, PILNA has addressed technical issues such as sampling and analysis methodology to improve the legitimacy of its results for its regional education stakeholders. Additionally, the contextual component and data usage workshop enables governments to explore key questions and issues aimed at informing evidence-based policy development – for example, early learning opportunities and teacher training and certification. From a technical standpoint, PILNA has addressed issues related to noncomparability of data over cycles, so that results are easier for policymakers to use in monitoring and evaluation of policy initiatives. Over the first three cycles of administration (2012, 2015, and 2018), PILNA has been an integral part in informing the education policy agendas of Pacific countries. In considering how large-scale assessments inform policy, Sutcliffe and Court (2005) define four stages of the policy cycle: agenda-setting, policy formulation, policy implementation and monitoring, and policy evaluation. As PILNA develops, it addresses all four stages of the policy cycle. Data from PILNA have raised awareness of Pacific Islands’ senior education decision-makers and led to their ongoing endorsement of PILNA’s purpose. In terms of policy formulation, the recent work with regional education stakeholders on learning how to explore their data has helped those stakeholders ask questions from which to develop and implement policy. Each cycle of PILNA includes a full field trial of all cognitive and contextual instruments, and processes including sampling, field operations, test administration, coding, and data analysis. The field trial provides an opportunity for EQAP and participating governments to test the range of contexts in which PILNA is administered and is typically carried out in all countries that will participate in the main study. The field trial results enable the Steering Committee along with each country’s curriculum specialists and national statistics officers to make decisions about the main study. The processes undertaken in PILNA, particularly with respect to item development, field trial, and administration have led participating countries to reexamine their national assessment tools and practices and in some cases, take on board some of the quality assurance steps that are integrated into the PILNA program.

132

S. K. Ahmed et al.

The consensual process of decision-making and policy-making in the Pacific is a point of difference from other ILSAs. In other words, all key stakeholders for PILNA are involved at each stage of each cycle, and use this involvement to inform their country policy-making processes.

What Does PILNA’s Strategy Look like in Terms of Access and Results Dissemination to Stakeholders, Including the Media? PILNA has a range of dissemination strategies that are critical to widespread ownership and acceptance of its results. Three series of reports are published and disseminated for each PILNA cycle, namely, a regional report, 15 country reports, and a small island states report. The PILNA Regional Report is formally launched during its reporting year and is made available publicly on the EQAP website. Country-specific reports are distributed to each participating country, and sharing of country-level data is at the behest of each government. A Small-Island States Report is additionally developed for a political grouping of five small island states. Through the PILNA Steering Committee, countries have endorsed a data-sharing commitment that outlines control over data sharing, particularly empowering each government with the decision to share its country-specific results. The PILNA Steering Committee has worked collaboratively to develop a communications plan to ensure countries are engaged in the reporting of their results. For example, each country reviews and comments on a first draft of its PILNA report and has final signoff before the report is published. The reports include data and analyses aimed at system-level stakeholders, but they also include “coding stories” – specific examples from the PILNA data highlighting information about the frequency of different or incorrect responses by students, particularly in the identification of common misconceptions. These data provide insight into different levels of understanding or ability in relation to concepts and skills assessed in literacy and numeracy. The coding stories also provide practical information for teachers in terms of ways to support student learning that may help overcome a specific misconception. This use of data makes the report of interest to teachers and school leaders, and potentially to teacher education institutions. EQAP provides a Ministerial Brief with key findings and recommendations emerging from PILNA. The Ministerial Briefs are country-specific. The communications plan acknowledges that data needs to be available not only for system-level improvements of education quality, but also accessible to teachers and schools. EQAP supports countries in thinking through communication strategies in meaningful ways for country stakeholders. PILNA has been widely covered in Pacific media, with its results factors in helping the public understand the importance of education assessment. Dissemination missions are structured to provide fit-for-purpose workshops to share and discuss results with a variety of stakeholders. A high-level workshop is held with

7

The Role of International Large-Scale Assessments (ILSAs) in. . .

133

senior ministry officials and then more in-depth discussions are carried out with ministry officials responsible for curriculum and assessment. Ministries are encouraged to include national posts of overseas development organizations in the workshops to enhance the use of PILNA data, findings, and recommendations in a coordinated way. In a number of countries, teachers and head teachers participate in data dissemination workshops, which have the potential to contribute to insights on what students know and can do. These workshops are conducted in ways that provide opportunities to explore the national results. The workshops also provide a forum for participants to share ideas, for example, about using results to inform classroom instruction. Some participating countries also facilitate community and/or parent workshops with EQAP available to support their ministry counterparts. These dissemination activities enable stakeholders to learn about and become engaged in PILNA data. Importantly, all activities related to PILNA remain aligned with the six key purposes identified by the PILNA Steering Committee in 2015, and these shape a consensual process of policy-making in the Pacific Islands region.

Southeast Asia Primary Learning Metrics (SEA-PLM) A Brief Overview of SEA-PLM The South East Asian Primary Learning Metrics (SEA-PLM) program was initiated in 2012 by the Southeast Asian Ministers of Education Organization (SEAMEO) and the United Nations Children’s Fund (UNICEF). The program was designed as a response to mitigate educational challenges faced in a region which, while having many national and other assessments, was without an assessment for students at the end of primary schooling. SEA-PLM stood apart from other international assessments as it was developed with a focus on regional issues, regional ownership, and building on existing regional capacity. Countries participating in the SEA-PLM program anticipated that such an assessment would strengthen their capability to report on the Sustainable Development Goal (SDG) 4 targets (UNICEF and SEAMEO, 2019). They also expected the program to work toward creating a better understanding of the status of student learning achievement and thereby improving the quality of their education systems (UNICEF and SEAMEO, 2019). SEA-PLM assesses student academic outcomes in four key subject areas toward the end of primary school (i.e., Grade 5), namely, mathematics, reading, writing, and global citizenship. The first round of SEA-PLM was conducted in 2019 and included Cambodia, Lao PDR, Malaysia, Myanmar, the Republic of the Philippines, and Vietnam. The conceptual framework for SEA-PLM has been customized to the ASEAN cultural background and tailored to reflect the curriculum of all ASEAN/ SEAMEO member countries in the region (UNICEF and SEAMEO, 2017). Contextual questionnaires were designed to collect data that reflected issues and priorities of particular relevance to students, parents, schools, communities, and policymakers in the Southeast Asian region.

134

S. K. Ahmed et al.

What Is the Role of Capacity Building Activities and Technical Quality in SEA-PLM? Two of the three aims that are central to SEA-PLM refer to capacity building. Thus, in addition to improving regional integration for monitoring student outcomes in four subject areas, the program aims to achieve (UNICEF and SEAMEO, 2019, p 6): • Capacity enhancement for gathering and analyzing assessment data at regional, national, and subnational levels. • Capacity building for utilizing assessment data and improving educational outcomes at regional, national, and subnational levels. The central focus of capacity support to participating SEA-PLM countries is reflected by the implementation of four large-scale regional workshops, which coincided with 37 in country capacity support sessions (UNICEF and SEAMEO, 2019) with support from SEA-PLM’s technical partner, the Australian Council for Education Research (ACER). Each of these sessions was designed to respond to the specific needs of each country, for the development, piloting, and implementation of the SEA-PLM initiative. Capacity support was not limited to countries participating in SEA-PLM. Instead, regional workshops were open to all ASEAN/SEAMEO countries, providing both SEA-PLM and non-SEA-PLM participating countries opportunities to connect with one another and build peer knowledge about the nature of assessment, quality assurance for operationalizing ILSAs including sampling, translation, adaptation, test administration, data submission, and item coding for common metrics development across the region (UNICEF & SEAMEO, 2019).

How Is SEA-PLM Integrated into the Policy-Making Process of Participating Countries? SEA-PLM is embedded in national as well as regional systems and structures. The program is guided by a Regional Steering Committee comprising all 11 SEAMEO member countries (not only the SEA-PLM participants). The Regional Steering Committee is supported by the SEA-PLM Secretariat, represented by UNICEF and SEAMEO. SEA-PLM has been endorsed at the SEAMEO High Officials Meetings (HOM) and included in the Association of Southeast Asian Nations (ASEAN) 2016–2020 work plans. SEA-PLM has been acknowledged by all ASEAN countries to be of strategic importance to the SEAMEO political framework – thus confirming regional ownership of the program. The political structures of SEA-PLM ensures participating ministries of education have a strong sense of ownership, which is strengthened by involving National Technical Teams (NTTs) that are directly involved in building capacity of their national systems (UNICEF and SEAMEO, 2019). Moreover, policy integration is enhanced by ensuring that the NTTs coordinating SEA-PLM are the same teams who manage national and other international large-scale

7

The Role of International Large-Scale Assessments (ILSAs) in. . .

135

Fig. 2 SEA-PLM governance structure. (Source: SEAMEO and UNICEF, 2019)

assessments in the countries. The buy-in from participating countries is demonstrated in ministries committing their own national budgets to building systems to monitor learning, which is aimed at ensuring the sustainability and continuity of the SEA-PLM initiative. The governing structure for the program depicted in Fig. 2 exhibits the level of regional integration and participation and how these are embedded in national systems. As SEA-PLM progresses, it is expected to address all four stages of the policy cycle (Sutcliffe & Court, 2005). Still, at this early stage of the first cycle just being concluded, ministries from SEAMEO countries agree that, so far, the evidence from the program mainly addresses the agenda setting aspect of the policymaking cycle, particularly by reporting against SDG 4.1 indicators (SDG, United Nations, 2016). However, efforts in terms of reporting, further analyses, and dissemination are intended to support the use of information from SEA-PLM for EBP.

What Does SEA-PLM’s Strategy Look like in Terms of Access and Results Dissemination to Stakeholders, Including the Media? Reports of the results from the first round of SEA-PLM that were released in December 2020 (UNICEF-SEAMEO, 2020) have played an important role in supporting countries to understand their results and begin the development of appropriate policies to support student learning. The SEA-PLM strategy has been to focus first on insights that arise from a detailed understanding and measurement of student outcomes and second on insights arising from an exploration of factors related to student outcomes.

136

S. K. Ahmed et al.

Thus, first, SEA-PLM reports provide details of a common understanding and measurement of educational outcomes in terms of what students know and can do by developing a set of Described Proficiency Scales (DPSs) for each of the three cognitive domains. The DPS is an ordered set of descriptions of the skills and knowledge gained by learners as they make progress in a domain of learning that are underpinned by an underlying empirical scale of proficiency. While the assessment framework defines the concepts of literacy in reading, writing, and mathematics, a DPS contains information about how these literacies develop. DPSs also allow the performance of students on the assessment to be understood qualitatively, not just as numerical scores. This information can then be used to inform teachers about how they can effectively assist students to move from one level of proficiency to the next. The second strategy is to focus on insights from SEA-PLM regarding other factors associated with student outcomes. While further analyses are required to explore how information obtained in student, teacher, and principal questionnaires related to outcomes, some information is already available regarding factors that go beyond that of the school environment. Thus, for example, many countries in the region are made up of populations with hundreds of national languages where the use of mother tongue or the adoption of a common language of instruction continues to be fiercely debated. However, the results of SEA-PLM suggest countries that have long standing, consistent, and effectively implemented national language policies have seen better outcomes in student reading, writing, and mathematics. Financing of education is also an important factor. Countries that see strong outcomes in reading also report government expenditure on education in excess of 4% of GDP, with some countries that have not performed as well allocating less than 2%. It is not just about how much money is spent on education, but rather, how it is spent. Improving reading outcomes requires resources to be allocated to welldesigned and effectively targeted intervention strategies that start early. In some relatively higher performing countries, financial allocations to pre-primary education are equal to that of the primary education budget. While these are first insights, the extent to which these data will be used in further analyses for EBP depends on the participating country’s level of understanding of the data. Therefore, in terms of its dissemination strategy, the focus of the next stage of SEA-PLM is on creating awareness about data interpretation and usage so that the lack of awareness does not become a barrier for the application of SEA-PLM results for education policy-making. Going forward, another key aspect of SEA-PLMs results and dissemination strategy is aimed at emphasizing to SEA-PLM countries the importance of incorporating student assessment data in their capacity development plans and ensuring that SEA-PLM results are well understood and applied for improving learning. This extends to matters relating to curriculum, teacher education, school and system financing, and tracking progress against the SDGs. In summary, SEA-PLM can be said to belong to the people of the region and has been tailored specifically for them. There is therefore confidence that the results from

7

The Role of International Large-Scale Assessments (ILSAs) in. . .

137

such a customized ILSA will provide policy directions and help mitigate educational issues that are unique to the Southeast Asian region.

Conclusion In conclusion, the role of ILSAs in economically developing countries has been well illustrated in this chapter through PILNA and SEA-PLM as specific instances of ILSAs. It has been shown how these programs have used assessment characteristics, design, and implementation as well as knowledge of factors influencing the update of evidence in policy-making to increase their likelihood to be used in EBP. In terms of assessment characteristics, both PILNA and SEA-PLM employ high quality probability samples to enable the accurate estimation of student outcomes in participating countries. SEA-PLM measures student outcomes in terms of performance toward the end of primary schooling, namely, at the end of Grade 5, which, over time, will enable the monitoring of student performance of successive Grade 5 cohorts in participating countries. PILNA assesses student performance at two Grades, namely, after 4 and 6 years of school, respectively. Although this means that student performance will be able to be monitored at two levels, no change of performance for a particular student cohort over time (e.g., Grade 4 in 2012 to Grade 6 in 2014) is possible since PILNA is undertaken every 3 years. In terms of subject areas, while both regional ILSAs assess numeracy and literacy, SEA-PLM, in addition, assesses students’ writing skills and their knowledge and attitudes in the area of global citizenship. Both PILNA and SEA-PLM have sought to increase the relevance of the assessments for EBP through their design and implementation. As called for by Cresswell et al. (2015), they have provided a far greater opportunity to ask questions of students, teachers, and principals that are relevant to local policymakers and in a way that reflects the geographical and cultural context of participating education systems than other ILSAs. Combined with the better targeted measures of student outcomes in the assessed areas (Lockheed & Wagemaker, 2013), this has increased the potential to explore factors associated with different levels of student outcomes. They have done this with a view toward the ultimate purpose of comparative largescale assessments, which is to move beyond rankings and providing robust evidence to policymakers of what students, teachers, and school principals and communities can do to provide students with the best possible learning today and into the future (Lockheed, 2012). As illustrated in this chapter, both regional ILSAs have gone to some length to integrate the whole assessment program into the policy-making process. Both have high level ministerial buy-in and organizing principles for participating education systems. In PILNA, integration into the policy-making process of the Pacific region is evident in several ways. Thus, not only is there a consensual approach to decision-making by the PILNA Steering Committee but also the CEOs of each country have committed to six key principles that guide the ongoing development and use of evidence from PILNA. In addition, PILNA has addressed issues related

138

S. K. Ahmed et al.

to noncomparability of data over cycles, so that results are easier for policymakers to use in the monitoring and evaluation of policy initiatives over time. Similarly, SEA-PLM was endorsed at the highest policy-making level by SEAMEO High Officials Meetings (HOM) and incorporated into the Association of Southeast Asian Nations (ASEAN) 2016–2020 work plan. At the implementation level, integration is achieved by the national teams conducting SEA-PLM being the same teams managing other assessments in the participating countries’ ministries of education. As also illustrated, both regional ILSAs have been used at the agenda setting stage of the policy-making cycle. Still, as PILNA has been implemented three times, namely, in 2012, 2015, and 2018, it has seen more instances of evidence from the assessment used at the monitoring and policy evaluation stage of the policy-making cycle than SEA-PLM, which has only been implemented once so far. While one of the purposes of SEA-PLM is certainly the use of data for monitoring of SDGs in participating countries, its use for the monitoring of specific regional or national policy-initiatives is yet to be illustrated. What can also be seen clearly is that PILNA and SEA-PLM are at different stages of participating countries’ consideration of the results. In terms of instrument design and analyses, PILNA in its third administration is paying greater attention to exploring and analyzing factors that can explain differences in performance and to learning from other, similar, systems (Addey et al., 2017; Fischman et al., 2017). For a country participating in SEA-PLM, just having finished its first implementation, the main challenge is to overcome the preoccupation with rankings and what the assessment says about a system’s overall performance relative to that of other countries. In order to facilitate evidence-based policy-making, both regional assessment programs have spent a lot of effort and resources on capacity building activities. Various workshops have been conducted on all aspects of an assessment, from developing policy questions, through the design of frameworks and items for context questionnaires and cognitive tests to building capacity in the examination of the assessment databases by ministerial staff to explore key questions and issues for informed policy development. While such capacity building tended to focus on specific assessment skills, it facilitated not only cross-country discussions of education priorities and agendas but also within-country discussions of staff from different ministries or different departments within ministries. To turn possible barriers into facilitators, both assessment programs have spent some time and resources on developing guidelines for reporting and dissemination of results. More specifically, PILNA has developed a dissemination strategy that includes comprehensive coverage in Pacific media, and focuses on helping the public understand the importance of education assessments. Thus, both PILNA and SEA-PLM have built on the learnings from other largescale assessments by focusing on factors that have been shown to foster evidence from such assessments that are being used in various stages of the policy cycle. However, challenges in terms of keeping the political and public momentum going to ensure the sustainability of these assessments remain.

7

The Role of International Large-Scale Assessments (ILSAs) in. . .

139

As both PILNA and SEA-PLM develop further, it will be informative to undertake systematic reviews of the actual use of evidence from these programs in education policy to examine the extent to which the costs of these programs in terms of time, financial and human resources are commensurate with the benefits they render in terms of improving learning for all students in the participating education systems.

References Addey, C., Sellar, S., Steiner-Khamsi, G., Lingard, B., & Verger, A. (2017). The rise of international large-scale assessments and rationales for participation. Compare: A Journal of Comparative and International Education, 47(3), 434–452. Banks, G. (2010). Evidence-based policy making: What is it? How do we get it? In World scientific reference on Asia-pacific trade policies: 2: Agricultural and Manufacturing Protection in Australia (pp. 719–736). Belisle, M., Cassity, E., Kacilala, R., Seniloli, M.T., & Taoi, T. (2016). Pacific Islands Literacy and Numeracy Assessment: Collaboration and innovation in reporting and dissemination. (Using assessment data in education policy and practice: Examples from the Asia Pacific; Issue 1) ACER/UNESCO. Retrieved from https://research.acer.edu.au/cgi/viewcontent.cgi? article¼1023&context¼ar_misc Best, M., Knight, P., Lietz, P., Lockwood, C., Nugroho, D., & Tobin, M. (2013). The impact of national and international assessment programmes on education policy, particularly policies regarding resource allocation and teaching and learning practices in developing countries. Retrieved from Australian Council for Education Research website: https://research.acer.edu.au/ ar_misc/16 Braun, H. I., & Singer, J. D. (2019). Assessment for monitoring of education systems: International comparisons. The ANNALS of the American Academy of Political and Social Science, 683(1), 75–92. Breakespear, S. (2012). The policy impact of PISA: An exploration of the normative effects of international benchmarking in school system performance. OECD. Buckley, H., Tonmyr, L., Lewig, K., & Jack, S. (2014). Factors influencing the uptake of research evidence in child welfare: A synthesis of findings from Australia, Canada and Ireland. Child Abuse Review, 23(1), 5–16. Cherney, A., Povey, J., Head, B., Boreham, P., & Ferguson, M. (2012). What influences the utilisation of educational research by policy-makers and practitioners?: The perspectives of academic educational researchers. International Journal of Educational Research, 56, 23–34. Cox, C., & Meckes, L. (2016). International large-scale assessment studies and educational policymaking in Chile: Contexts and dimensions of influence. Research Papers in Education, 31(5), 502–515. Cresswell, J., Schwantner, U., & Waters, C. (2015). A review of international large-scale assessments in education: Assessing component skills and collecting contextual data, PISA, The World Bank, Washington, D.C. OECD Publishing. Department for International Development (DFID). (2013). Education position paper: Improving learning, expanding opportunities. Retrieved from https://www.gov.uk/government/uploads/ system/uploads/attachment_data/file/225715/Education_Position_Paper_July_2013.pdf Department of Foreign Affairs and Trade (DFAT). (2015). Strategy for Australia’s aid investments in education 2015–2020, September 2015. Retrieved from https://dfat.gov.au/about-us/ publications/Pages/strategy-for-australias-aid-investments-in-education-2015-2020.aspx Fischman, G. E., Topper, A. M., Silova, I., Holloway, J. L., & Goebel, J. (2017). An examination of the ınfluence of ınternational large scale assessments and global learning metrics on national school reform policies.

140

S. K. Ahmed et al.

Gilmore, A. (2005). The impact of PIRLS (2001) and TIMSS (2003) in low- and middle-income countries. International Association for the Evaluation of Educational Achievement (IEA). Heyneman, S. P., & Lee, B. (2014). The impact of international studies of academic achievement on policy and research. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues and methods of data analysis (pp. 37–72). Taylor and Francis Group. Jones, H. (2012). Promoting evidence-based decision-making in development agencies. ODI Background Note, 1(1), 1–6. Kamens, D. H. (2013). Globalization and the emergence of an audit culture: PISA and the search for ‘best practices’ and magic bullets. PISA, power and policy: The emergence of global educational governance, 117–140. https://books.google.com.au/books?hl=en&lr=&id=fqFwCQAAQBAJ& oi=fnd&pg=PA117&ots=Q-hHrXY0-b&sig=w2b5OLh3egcR_uDEBjjRJllbHcg#v=onepage& q&f=false Kim, H. K., Lee, D. H., & Kim, S. (2016). Trends of science ability in the National Assessment of Educational Achievement (NAEA) of Korean ninth graders. EURASIA Journal of Mathematics, Science and Technology Education, 12(7), 1781–1798. Kvernbekk, T. (2015). Evidence-based practice in education: Functions of evidence and causal presuppositions. Routledge. Lietz, P., Tobin, M., & Nugroho, D. (2016). Understanding PISA and its impact on policy initiative: A review of the evidence. In L. M. Thien, N. A. Razak, J. P. Keeves, & I. G. N. Darmawan (Eds.), What can PISA 2012 data tell us? Performance and Challenges in Five Participating Southeast Asian Countries. Sense Publishers. Lockheed, M. (2012). Policies, performance and panaceas: The role of international large-scale assessments in developing countries. COMPARE: A Journal of Comparative and International Education, 42(3), 509–545. Lockheed, M. E., & Wagemaker, H. (2013). International large-scale assessments: Thermometers, whips or useful policy tools? Research in Comparative and International Education, 8(3), 296–306. Mullis, I. V. S., & Martin, M. O. (Eds.). (2017). TIMSS 2019 assessment frameworks. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timssandpirls. bc.edu/timss2019/frameworks/ OECD. (2019a). PISA 2018 assessment and analytical framework. PISA, OECD Publishing. OECD. (2019b). Evaluation and assessment: Policy priorities and trends, 2008-19. In Education policy outlook 2019: Working together to help students achieve their potential. OECD Publishing. https://doi.org/10.1787/25806c6b-en Oliver, K., Innvar, S., Lorenc, T., Woodman, J., & Thomas, J. (2014). A systematic review of barriers to and facilitators of the use of evidence by policymakers. BMC Health Services Research, 14(2), 1–12. https://doi.org/10.1186/1472-6963-14-2 Orem, J. N., Mafigiri, D. K., Marchal, B., Ssengooba, F., Macq, J., & Criel, B. (2012). Research, evidence and policymaking: The perspectives of policy actors on improving uptake of evidence in health policy development and implementation in Uganda. BMC Public Health, 12(109), 1–16. https://doi.org/10.1186/1471-2458-12-109 Pacific Community (SPC). (2019). Pacific Islands Literacy and Numeracy Assessment (PILNA) 2018 regional report. SPC. Sanderson, I. (2003). Is it ‘what works’ that matters? Evaluation and evidence-based policymaking. Research Papers in Education, 18(4), 331–345. https://doi.org/10.1080/ 0267152032000176846 Schulz, W., Ainley, J., Fraillon, J., Losito, B., & Agrusti, G. (2016). IEA international civic and citizenship education study 2016: Assessment framework. International Association for the Evaluation of Educational Achievement (IEA). SEAMEO & UNICEF. (2019). Governance structure. Retrieved from https://www.seaplm.org/ index.php?option¼com_content&view¼article&id¼23&Itemid¼224

7

The Role of International Large-Scale Assessments (ILSAs) in. . .

141

Shamatov, D., & Sainazarov, K. (2010). The impact of standardized testing on education in Kyrgyzstan: The case of the Programme for International Student Assessment (PISA) 2006. International Perspectives on Education and Society, 13, 145–179. Singer, J. D., & Braun, H. I. (2018). Testing international education assessments. Science, 360(6384), 38–40. Sutcliffe, S., & Court, J. (2005). Evidence-based policymaking: What is it? How does it work? What relevance for developing countries? https://apo.org.au/node/311114. Tobin, M., Lietz, P., Nugroho, D., Vivekanandan, R., & Nyamkhuu, T. (2015). Using large-scale assessments of students’ learning to inform education policy: Insights from the Asia-Pacific region. Australian Council for Educational Research (ACER). http://research.acer.edu.au/ monitoring_learning/21 UNICEF & SEAMEO. (2017). SEA-PLM 2019 Assessment Framework (1st ed.). United Nations Children’s Fund (UNICEF) & Southeast Asian Ministers of Education Organization (SEAMEO) – SEA-PLM Secretariat. UNICEF & SEAMEO. (2018). SEA-PLM 2019 trial testing report. United Nations Children’s Fund (UNICEF) & Southeast Asian Ministers of Education Organization (SEAMEO) – SEA-PLM Secretariat. UNICEF & SEAMEO. (2019). The Southeast Asia primary learning metrics program: Thinking globally in a regional context. Australian Council for Educational Research. UNICEF & SEAMEO. (2020). SEA-PLM 2019 main regional report, children’s learning in 6 Southeast Asian countries. United Nations Children’s Fund (UNICEF) & South East Asian Ministers of Education Organization (SEAMEO) – SEA-PLM Secretariat. United Nations. (2016) Sustainable development goals: Quality education. Retrieved from https:// www.un.org/sustainabledevelopment/education/ United Nations Educational, Scientific and Cultural Organization (UNESCO). (2015). Incheon declaration — education 2030: Towards inclusive and equitable quality education and lifelong learning for all. UNESCO. United Nations Educational, Scientific and Cultural Organization (UNESCO). (2018). UNESCO pacific strategy 2018–2022. UNESCO. United States Agency for International Development (USAID). (2019). Learning from experience. Retrieved from https://www.usaid.gov/project-starter/program-cycle/cdcs/learning-fromexperience Wagner, D. A., Lockheed, M., Mullis, I., Martin, M. O., Kanjee, A., Gove, A., & Dowd, A. J. (2012). The debate on learning assessments in developing countries. Compare: A Journal of Comparative and International Education, 42(3), 509–545. Willms, J. D. (2018). Learning divides: Using monitoring data to inform education policy. UNESCO Institute for Statistics. Young, E., & Quinn, L. (2002). Writing effective public policy papers. Open Society Institute.

Part III Meta-perspectives on ILSAs: The Role of Theory in ILSAs

8

Comprehensive Frameworks of School Learning in ILSAs Agnes Stancel-Piątak and Knut Schwippert

Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Models of School Learning and their Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Shift from Input-Output to Input-Process-Output Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . Extensions from the Socio-Ecological Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Extensions Reflecting the Dynamic Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Models of School Learning in ILSA Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Beginning: Pilot Twelve-Country Study, Early ILSAs (FIMS, SIMS, IRLS), and IEA’s Foundational Curriculum Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subsequent ILSAs: IEA’s TIMSS and PIRLS and OECD’s PISA and TALIS . . . . . . . . . . . . Early Childhood Education and Care (ECEC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions and Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

146 148 149 151 152 154 155 161 165 167 170

Abstract

The major aim of International Large-Scale Assessments (ILSAs) of school systems is to collect comprehensive information about factors related to learning, teaching, and student outcomes from various education systems in order to identify possible factors and areas for improvements on different levels and with different perspectives. Although ILSAs differ in their specific research and policy focus, the overall study designs rely on a common understanding of the structure and processes within education systems. Learning in schools is perceived as being embedded into the context of (a) the local school with its different actors, facilities, and regulations, and (b) the national or regional context. More A. Stancel-Piątak (*) IEA Hamburg, Hamburg, Germany e-mail: [email protected] K. Schwippert Faculty for Educational Sciences, University of Hamburg, Hamburg, Germany e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_8

145

146

A. Stancel-Piątak and K. Schwippert

recently, due to continuous globalization, school systems are increasingly perceived, analyzed, and described in a global perspective. The development of respective theoretical foundations describing students’ learning within the school system can be tracked back to the 1960s when the idea of opportunity to learn developed. Recent extensions are based upon the input-process-and-output models and draw from economic and system theories. They provide a comprehensive and elaborated framework describing learning as a dynamic process within the school system. The main aim of this chapter is to first review the most prominent frameworks of school learning. Next, ILSAs are reviewed with respect to the extent they draw implicitly or explicitly on these frameworks. Finally, possibilities for further developments are discussed. The structure of the chapter follows these aims. Keywords

Development of ILSA · Comprehensive Theoretical Frameworks · Effectivenes of Education Systems

Introduction In the last few decades models of student learning have been systematically developed and extended based on empirical research and theoretical advances. Among others, system theories have substantially contributed to provide an analytical frame for a comprehensive perspective on learning in educational settings. Current International Large-Scale Assessments (ILSAs) are often anchored in the tradition of their preceding studies, which, to some extent, rely on the systemic perspective, even if it is not always made explicit in the framework. One of the features found in ILSAs, which can be attributed to the influence of the system theory, is the common perception typical to the input-process-output paradigm, according to which, the learning process is described as a “production process” and its outcome as a “product” (Anderson, 1961). The input-processoutput paradigm is part of the systems theory, which itself evolved within biology science, but was further developed to become a general systems theory applicable to all systems: biological, physical and sociological (Kast & Rosenzweig, 1972). While this term was commonly used in early ILSAs, it was replaced by “learning process” in the recent literature. Another common feature ILSAs owe to system theory pertains to the assessment, description, and analyses of student learning as being embedded into the larger context of national education systems and considering specific societal and cultural conditions. Beyond this, more recent discussions on school learning refer more frequently to the global perspective reflecting the interests of global political and financial organizations to pursue international comparisons (for a critical review see ▶ Chap. 4, “Reasons for Participation in International Large-Scale Assessments”). Aside from the overarching systemic views on school learning, theoretical frameworks of ILSAs draw

8

Comprehensive Frameworks of School Learning in ILSAs

147

from empirical educational research as well as from theories of pedagogical content knowledge, psychological research, and psychometrics. Nonetheless, the extent to which the theoretical foundations in ILSA frameworks are elaborated and to which theory preeminence is given, varies between studies depending (a) on whether the study has stronger policy or research orientation (which oftentimes is closely related to the founding organization). It further depends (b) on the expertise and interests of the study conductors; (c) priorities of participating education systems; and, last but not least, (d) on the financial resources available. The selection of topics and content areas is guided by the goals and focus of the study. The studies also differ from each other in their approaches to student assessment with respect to the extent to which they aim at achieving curricular sensitivity. The most preeminent examples are PISA (Programme for International Student Assessment) and TIMSS (Trends in International Mathematics and Science Study) or PIRLS (Progress in International Reading Literacy Study). While PISA explicitly distances itself from the idea of curricular sensitivity by focusing on skills defined as key competences for participation in the society (including global competence in the 2018 cycle, OECD, 2019), TIMSS and PIRLS strives toward a charming but ambitious goal to assess educational opportunities defined by a “broad agreement” about the underlying intended curriculum assumed to embrace diverse curricula of participating education systems around the world (Mullis & Martin, 2013), p. 13. Aside from the vivid research interest that ILSAs have increasingly experienced over the last decades, there is an increased policy interest on research findings from ILSAs in education. This has a positive effect on the increasing number of participating countries and funds available on the one hand, but it is also related to increasing expectations related to the outcomes of the studies and their potential contribution to guide policy on the other. This not only implies a need for increased efforts by researchers and study conductors to ensure quality, but also a need for accountability, justification, and documentation of decision-making processes. Study frameworks are one of the products from ILSAs, which are expected to provide justification for the content and scientific soundness of the whole study. From a scientific point of view, the conceptual framework is the very core of an empirical study and should provide the theoretical background for the research questions as well as an overview of recent research findings including explanations on how these have been implemented in the respective empirical application. This chapter reviews ILSAs’ overarching frameworks, which are based on different traditions of social sciences by reflecting on their specific theoretical foundations. The aim is to provide an overview of study frameworks based on selected criteria, which will be derived from relevant theories on student learning in school. It should be noted that the focus is on the overall theoretical frameworks, which serve to describe the teaching and learning processes within education systems, while the detailed assessment frameworks of each study are not included. With regard to the latter, it should be noted that despite the diversity of approaches and foci, these frameworks are well elaborated and of high quality in the majority of ILSAs.

148

A. Stancel-Piątak and K. Schwippert

The chapter starts with an overview of the most preeminent models of school learning including their extensions. Model developments from input-output models into increasingly comprehensive and complex models are described in a historical perspective in the light of theoretical developments within the field of school effectiveness and educational effectiveness research. Further in this chapter, frameworks of the major current ILSAs and their earlier forerunners are reviewed with respect to the theoretical foundations of the overall conceptual framework. The review will rely on criteria derived from the proceeding model overview. Due to the pioneering work of the International Association for the Evaluation of Educational Achievement (IEA) in the development of ILSA, this part of the chapter starts with a review of early IEA studies and continues with ILSAs implemented by the two most well-known major study conductors, IEA and the Organisation for Economic Co-operation and Development (OECD). Space limitations do not allow for exploration beyond the major ILSAs although some minor studies are also briefly mentioned, and sources are provided for further insight. Finally, challenges and potentials of the recent ILSA procedures with respect to theoretical study framing are critically discussed and conclusions for future study development are provided.

Models of School Learning and their Developments Models of school learning have been developed from the early 1960s on, when the debate around school learning and student outcomes became a vivid topic for researchers and policy makers, at first in the United States (USA) with the ground breaking study from Coleman et al. (1966), which led him to a controversial conclusion about the low impact of schooling on student achievement. Early models were influenced primarily by psychological theories on learning and teaching, while the economic and systemic perspectives came gradually into play responding to the need for a framework that would suit a broader description of the school system with its multilevel and multifaceted structure. The systemic perspective became an integral part of organizational theories in the 1960s (Katz & Kahn, 1966). Organizational theories contributed to extending learning models to include factors of the environment, based on the idea that learning takes place within the larger social-political-philosophical system (Scheerens & Bosker, 1997). Over the years, research on school learning gradually gained popularity due to societal changes and the related, increasingly growing, emphasis that countries’ policy agendas placed on education and education policy. Large-scale assessments have played a tremendous role in these developments in many respects, starting with the very early large-scale investigations on the national level (e.g., Coleman et al., 1966; Jencks, 1973), and expanding later on to the international level. ILSAs have contributed – sometimes in a controversial manner – to theoretical developments, deeper understandings of education systems and their cultural specifics, policy discourse and reforms, as well as to developments of research methods. In recent publications, the research around school learning is

8

Comprehensive Frameworks of School Learning in ILSAs

149

labeled with varied and differently nuanced terms, such as school effectiveness research, school learning, and research on teaching and learning, to mention just a few. In this chapter, we do not aim at clarifying all terms used in the literature, or at providing a fully comprehensive overview of all research in this area. Instead, the chapter focuses on the theories that have guided the developments of the internationally influential ILSAs. It provides an opportunity to link together relevant developments in research areas related to ILSAs, which often emerged independently. Observing the developments in school learning theories, empirical research, and education policy, it becomes obvious that none of these diverse views on education was sufficiently comprehensive to embrace students learning at school in a holistic manner. Research strands that explored learning and teaching have merged over the years with those research strands focused on the school system as a whole. The resulting theoretical perspectives became as comprehensive and complex, as nuanced and differential as they currently are (see, as an example of an empirical application, the emerging idea of differential effectiveness in Scherer & Nilsen, 2018).

The Shift from Input-Output to Input-Process-Output Paradigm One of the most preeminent models focusing on teaching-learning processes in the classroom was Carroll’s (1963) model of school learning. It was originally developed within the input-process approach to describe foreign language learning and training. The model assumed that learning is a function of the following major factors: aptitude, perseverance, opportunity to learn (OTL), ability to understand instruction, and quality of instruction. The model was based on behavioral theories of structural learning and cognitive theories (Bandura, 1986). Aside from classroom variables the model considers individual traits of students. OTL became an integral part of the theoretical framing in early ILSAs almost from the beginning on. The very first applications are described in the final report of the First International Mathematics Study (FIMS, Husén, 1967a, p. 65) for which the data collection was conducted in 1964. In the Six-Subject Study (Bloom, 1969), Carroll’s model substantially contributed to the theoretical framing enabling a systematic inclusion of measures for all variables relevant for learning of French (Carroll, 1975, pp. 42–43). The concept was further developed and extended in the following IEA studies in particular, the Second International Mathematics Study (SIMS, Travers & Westbury, 1989), and the Second International Science Study (SISS, 1983–84, Anderson, 1989; Jacobson & Doran, 1988; IEA, 1988). Subsequent models were extended by students’ ability to understand instruction or general intelligence, a factor that was found to be an important predictor of learning outcomes in studies based on Carroll’s model (Creemers et al., 2000). The idea was followed up by other authors in the concept of metacognition and the ability to learn how to learn, for instance, in Bloom’s model of mastery learning (1986), or in the concept of direct instruction from Rosenshine and Stevens (1986).

150

A. Stancel-Piątak and K. Schwippert

Moreover, subsequent models contributed substantially to the development of the concept of OTL, extending Carroll’s initial idea by adding additional levels and quality dimensions to the previously solely quantitatively described concept. Based on the idea that opportunities to learn for students result from the intended (structurally predefined) and the implemented curriculum, a high correlation between the intended, implemented, and achieved (students’ learning outcomes) curriculum is considered as a quality criterion of educational effectiveness (McDonell, 1995). Related empirical investigations of school learning have been heavily criticized. Their inconsistent results were blamed on the focus on single factors, narrow view on student outcomes, and inadequate methodology (Reynolds & Teddlie, 2000, pp. 157–158). This criticism has initiated theoretical developments, which aimed at a comprehensive view on learning and teaching processes as well as more focused definitions and precise measures. Under the influence of the socio-ecological theory the focus shifted gradually from single factors toward a comprehensive approach (Lezotte, 1991), in which the micro-processes are considered as interacting with factors from the mezo- and macro-level of the system (Bronfenbrenner, 1978, 1981). The need for explanations for effective and less effective schools accompanied the paradigm shift from input-output to input-process-output, moving teaching and learning as reciprocal processes gradually into the focus of the research on student learning. While Carroll assumed that teaching quality is a stable construct expected to produce positive results relatively independent from students’ traits and motivation, subsequent models were based on the assumption that positive learning outcomes emerge as a function of the interplay between the individual and teaching factors. For instance, based on the principle of consistency, Slavin assumed mutual dependence of factors considered as quality indicators in his QAIT model (quality of instruction, appropriate levels of instruction, incentives, time/opportunity to learn) (Slavin, 1987, 1996). Slavin’s QAIT model was further elaborated in the context of effective classroom research by Squires and colleagues (Squires et al., 1983) and by Huitt (2003, 1997a, 1997b) in the transactional model of the teaching/learning process (see also Huitt et al., 2009). The model expanded the original QAIT model by (a) additional measures of learning outcomes, such as self-efficacy and self-regulation aside from achievement or basic skills; (b) teacher characteristics (e.g., self- efficacy); (c) context factors (e.g., school characteristics); (d) factors of the learning and teaching processes; and (e) characteristics of home environment. Planning (e.g., getting ready for classroom activity) and management (e.g., getting control of the classroom) are included as specific categories in Huitt’s model, whereas they were only implicitly addressed in both Carroll’s and Slavin’s models. Stingfiled and Slavin (1992) have extended Slavin’s model to the QAIT-MACRO model (meaningful goals, attention to daily functioning, coordination, recruitment of teachers, organization) including the institutional level, however, without specifying explicitly the nature of relationships between the levels. According to the authors, effective school learning – labeled as “mastery learning” – occurs in settings where factors from several levels are aligned.

8

Comprehensive Frameworks of School Learning in ILSAs

151

Extensions from the Socio-Ecological Perspective Models describing student learning have emerged from various research traditions and have been developed over time and, at first, independently from each other. On the one hand, studies within school effects research were mainly based on quantitative applications at the organizational level (school level) in which short-term outputs were perceived as effects of inputs. On the other hand, effective school research emerged from the qualitative psychological research around teaching and learning processes and outcomes within the process-output perspective. Models focusing on school processes and environmental factors causing change have emerged within school improvement research. Theoretical developments within the field of school learning research, in particular the contribution of the socioecological theory (Bronfenbrenner, 1978, 1981) have initiated a gradual fusion of the previously separated research strands from the 1980s on. It took several decades of research and scientific discourse until a more comprehensive and elaborated approach was established and subsumed under the school effectiveness research (SER) framework. SER as an overarching concept encompassed major strands in the field of educational research on students’ learning in school; however, it did not claim to embrace the entire range of research in this area (Reynolds & Teddlie, 2000). Another important research strand that emerged as a consequence of the theoretical developments was the educational effectiveness research (EER). Research within this area originated in two previously separated research traditions: on teacher effectiveness and school effects. Models developed within this framework typically addressed context factors (predominantly school factors) that influence student achievement, mediated by the classroom and teacher characteristics (Creemers & Kyriakidēs, 2008b; Creemers & Reezigt, 1996; Stringfield & Slavin, 1992). Despite this more comprehensive perspective on learning processes encompassing multiple level and factors, empirical research on student learning in the 1980s and 1990s was still often conducted within a particular research tradition (e.g., school effectiveness, school improvement, and teacher effectiveness) relying on a specific theoretical framing and empirical methods. Mixed method approaches and comprehensive empirical investigations have been applied occasionally. Together with digitalization, from the 1990s on, cross-sectional and international research was evolving rapidly, enhancing collaboration possibilities among researchers from different research fields. These developments contributed substantially to the growing understanding about the need to combine diverse views on school learning and to aggregate knowledge generated from theories and methodologies from different traditions (psychometrics and statistics as well as qualitative methods) to enable reliable results, in-depth discoveries, and valid interpretations. Under the influence of these developments, school learning models gained complexity and became more nuanced with respect to the description of indicators of process quality, stressing the necessity of mixed method approaches. For instance, although previous models (e.g., QAIT-MACRO model) displayed multiple levels of the system, they did not specify relationships between factors. Contrarily, the

152

A. Stancel-Piątak and K. Schwippert

comprehensive model of educational effectiveness developed by Creemers (1994) in the context of effective classroom research, assumes reciprocal dependency of factors within and across classrooms as well as between classrooms and at the school level. The author developed a comprehensive framework referring to the principles of consistency, cohesion, control, and constancy, which provide a frame to assess the quality of the process. Another aspect pertains to the definition of instructional quality considered a major dimension in Carroll’s model. In Creemers’ model, this dimension was further differentiated into curriculum, grouping, and teacher behavior (Creemers, 1994, p. 47; see also Creemers & Kyriakidēs, 2008b). Teaching and instructional quality can be assessed in this framework comparing the level of consistency between the intended, implemented and achieved curriculum (Schmidt et al., 1996; Schmidt & Cogan, 1996). The assumption thereby is that teaching and instructional effectiveness is influenced by its quality mediated by learning opportunities, considering individual student characteristics (motivation, ability, time spent on task). In his model Creemers stresses the interdependency of school and classroom factors, which are related to instruction, learning opportunities, and available time. While Creemers’ model focuses on processes within the classroom, Scheerens and Bosker (1997) put equal emphasis in their model on both the classroom and school level factors. The authors’ integrated multilevel educational effectiveness model is embedded into the input-process-output framework and combines instructional quality with the constructivist approach, while the school system is described based on economic theories. Similar to Creemers’ comprehensive model of educational effectiveness, Scheerens and Bosker assume that factors on different levels interact together in a dynamic and reciprocal way.

Extensions Reflecting the Dynamic Perspectives Recent globalization tendencies and internationalization of the education discourse require increasingly generic approaches, which could be applied in diverse contexts providing detailed and precise measures, enhancing detailed and valid interpretations, and useful policy recommendations (most commonly on the national level). To meet these expectations, recent models of school learning usually are comprehensive, multilevel, multifaceted, and assume a dynamic and reciprocal nature of interactions within the school system (e.g., the model from Creemers & Kyriakidēs, 2008b). Moreover, due to criticism on earlier research on school quality and effectiveness, the concept was broadened to encompass not only student characteristics aside from achievement, such as student traits related to learning, but also other system outcomes such as equality of opportunity (Kyriakides et al., 2018; Nachbauer & Kyriakides, 2019; Stancel-Piątak, 2017). Notably, recent publications tend to refer more often to EER, rather than to SER, as an overarching concept encompassing present research around school learning (e.g., Reynolds et al., 2014; Scheerens, 2017). This mirrors the shift of focus from the school system toward teaching and learning processes and the involved actors, i.e., students and teachers. An increasing

8

Comprehensive Frameworks of School Learning in ILSAs

153

number of studies underline the importance of the core process (i.e., learning and teaching) within the school and the significance of factors proximal to learning, while the school along with the area the school is located in and the education policy context are considered as contextual factors, which are distal to student learning (Hanushek, 2016; Scheerens, 2017; Seidel & Shavelson, 2007). While this perception is not novel, the particular research (and policy) focus on the concept of effectiveness as being dominated by the process quality (as expressed in the currently preferred term of educational effectiveness), rather than other system characteristics, has been predominantly apparent in more recent publications (e.g., Reynolds et al., 2014; Teddlie & Reynolds, 2000). The paradigm shift is not only supported by empirical findings, but also recently conditioned by the fact that the actors (students and teachers), rather than structural elements of the system, have been identified as more important (Hattie, 2009). One prominent example is the dynamic model of educational effectiveness from Creemers and Kyriakides (2008b; the model is also presented in this book in Chap. 13-1 in more detail), which addresses the need for overarching concepts, while being anchored in the tradition of previous models of school learning within EER. Similar to its predecessors, the model recognizes that teaching and learning processes that are at the core of interest take place within classrooms and are influenced by school factors and by the larger regional context. The respective levels (student, classroom, school, and context) are directly embedded into the model structure. Further, the model specifies reciprocal relationships between factors at different levels. It assumes direct and indirect effects of the contextual factors on student learning. The extent to which the model implements the OTL concept is unique. Not only does it focus on teacher characteristics, which are significant for the opportunity to learn, but also includes school and context level concepts important for learning opportunities (Creemers & Kyriakidēs, 2008a, 2008b). Compared to its predecessors, the model includes more elaborated and nuanced measures of school and contextual factors. The underlying assumption is that system variables can improve school quality over time and that their evaluations have to consider environmental factors. With this, the model refers to the idea of contingency, according to which adaptation and responsiveness to external conditions and the specific needs of the students are crucial. Evaluation – which is perceived as important for policy development and decision making – should accordingly focus on the extent to which the particular needs of the school clients (students) are addressed appropriately and if this fact contributed to increase the effectiveness of teaching and learning. An interesting feature of the model is the assumption of nonlinear relationships between teaching, learning, and student outcomes. For instance, the authors assume that some of the factors that initially improve student learning can cause a negative effect after reaching an optimum (resulting in a u-curve). Yet another progressive assumption of the model pertains to the inclusion of differential analysis implying that various characteristics of teaching and the system can have differential effects on different groups of students. Differential effects are evaluated on five dimensions: quantity, focus, stage, frequency, differentiation. An implication of the contingency idea is that

154

A. Stancel-Piątak and K. Schwippert

educational outcomes can only be improved if there is an overarching strategy that includes different levels and areas of the system. In order to capture the diversity of policy action the model was further extended by the factor “policy in action” as one of the system monitoring variables (Creemers & Kyriakides, 2012; Kyriakides & Demetriou, 2010).

Models of School Learning in ILSA Frameworks As previously discussed, the theoretical developments in models of school learning were manifold and emerged from diverse traditions. ILSAs, being part of this process (mainly within the school effectiveness strand), have had a major impact on these, as well as experienced significant developments under the influence of research in this area. In particular, the OTL concept was implemented into ILSAs at a very early stage and has been further developed in the scope of subsequent cycles. Recently, ILSAs have been an integral part of EER and research on school learning in general. In this paragraph, the frameworks of major ILSAs are reviewed with respect to the following criteria: 1. State-of-the-art theories: The extent to which ILSA frameworks are based on established theories on student learning in education systems considering, in particular, the following aspects: (a) Systematic and comprehensive approach to the description of the school system: different system levels are reflected and considered as hierarchical, and a systematic theory-based selection of the relevant factors is displayed (b) Comprehensive approach to the analysis of the outcomes: Reflected in the inclusion of other than cognitive outcomes (e.g., noncognitive student characteristics, equality) (c) Dynamic approach: The dynamic and reciprocal nature of relationships between different factors of the school system is acknowledged. 2. Empirical foundation: The extent to which hypotheses are derived from prior empirical findings. 3. Regular updates by new research findings and refinement for the cyclical studies: The extent to which each study framework was enhanced by recent empirical research and refined for each cycle. Notably, the review of the general study frameworks will mainly focus on the extent to which they explicitly reflect the theoretical frame provided by research on student learning in general and by models of educational effectiveness in particular, as described above. Theoretical assessment frameworks for the assessment domains (reading, mathematics, science, etc.) will not be addressed. Early IEA studies were precursors of the current ILSAs. In this section, respective developments are briefly described from a historical perspective (see also Papanastasiou et al., 2011 for more details). The review begins with the early IEA studies and continues with the subsequent IEA and OECD ILSAs conducted

8

Comprehensive Frameworks of School Learning in ILSAs

155

regularly from the 1990s on. Needless to say that this overview is not comprehensive. Apart from regional large-scale assessments which will not be addressed here (PRIDI: Programa Regional de Indicadores de Desarrollo Infantil; PASEC: Programme d’analyse des systèmes éducatifs de la CONFEMEN; and TERCE: Tercer Estudio Regional Comparativo y Explicativo), there are other ILSAs conducted by various organizations. Such studies are for instance the cyclical IEA assessments on computer literacy COMPED 1989 and 1992 (Computers in Education Study), SITES 1998–1999 (Second Information Technology in Education Study), and ICILS 2013 and 2018 (International Computer and Information Literacy Study). The assessment of computer literacy can be tracked back to SIMS (Second International Mathematics Study; Husen 1967a, b; Postelthwaite 1967). COMPED and SITES were forerunners of ICILS which is cyclical from 2013 on. Other cyclical studies assess civic education CIVED 1999 (Civic Education Study) and ICCS (International Civic and Citizenship Education Study, cyclical from 2009). The assessment of civic education was based on the prior work in the scope of the SixSubject Study with its component on civic education (Bloom 1969). Another outstanding ILSA is the Teacher Education and Development Study in Mathematics (TEDS-M 2007). TEDS-M examined how different countries prepare their teachers to teach mathematics in primary and lower-secondary schools. The study gathered information on various characteristics of teacher education, institutions, programs, and curricula. It also collected information on the opportunities to learn within these contexts, and on future teachers’ knowledge and beliefs about mathematics and learning mathematics (Tatto et al., 2008). These studies are described in more detail in other chapters of this book. Other studies conducted by the OECD such as PIAAC (Programme for the International Assessment of Adult Competencies) remain also unconsidered as their content exceeds the limits of the scope of this handbook.

The Beginning: Pilot Twelve-Country Study, Early ILSAs (FIMS, SIMS, IRLS), and IEA’s Foundational Curriculum Model Pilot Twelve-Country Study (1960) The very first international study conducted on a large scale by IEA was the Pilot Twelve-Country Study (Husén, 1967a, 1967b; UNESCO Institute for Education, 1962), which assessed the mathematics, science, reading comprehension, geography, and nonverbal ability of 13-year-old students. Subsequent studies implemented later have drawn from the results from this study as well as from the experience gathered in the scope of the study’s development and implementation. The study’s major contribution to the empirical educational research and comparative education was considered to be (a) the extended range of research and (b) the inclusion of empirical methods. The implemented analytic approach was perceived as an advancement over the, at first, descriptive and juxtaposing approaches (Husén, 1967a, p. 26). The possibility to compare various “series of environments in which human beings learn” through cross-country comparisons was assumed to provide a kind of “laboratory” situation “in which many of the more profound questions concerning human

156

A. Stancel-Piątak and K. Schwippert

growth can be studied objectively,” while acknowledging the challenging “task of specifying such environments with reasonable accuracy and in comparable and meaningful ways” (Husén, 1967a, p. 27). In this perspective, the world was viewed as a laboratory consisting of countries with their specific environment constituting a situation equivalent to the natural experiment, in which different education systems could be compared under controlled conditions. At the time the pilot study was conducted, it was a quite innovative attempt to develop and administer instruments with the goal to produce internationally comparable and meaningful results. With this respect, being a feasibility study, the pilot study fulfilled its goals, while revealing on the one hand the great interest of the participating national research centers (responsible for study coordination and implementation at the national level) and on the other hand the challenges related to an international empirical study of this scope. The documents provided for this early pilot study reveal that it was quite common to view education system within the input-output framework. Student achievement – more precisely their “intellectual functioning and attainment” – was at the core interest of the pilot study. It was perceived as an “output” or “product” of educational systems across the world, of which information should be obtained to provide “the major missing link in comparative education” pointed out by Anderson (1961, pp. 7–8). Interestingly, the authors of this very first implementation of a crossnational comparison were concerned, from the planning stages, with the public perception of the reported results, in particular with the use of country results for mean comparisons. Being aware of the boundaries of such types of international comparisons, it was stressed that the goal should be to “discern patterns of intellectual functioning and attainment in certain basic subjects of the school curriculum under varying conditions” rather than to “evaluate educational performances under different educational systems in absolute terms” (UNESCO Institute for Education, 1962, p. 5).

FIMS: First International Mathematics Study (1964) Following the Pilot Twelve-Country Study, IEA conducted the First International Mathematics Study (FIMS, 1964; Husén, 1967a, b; Postlethwaite, 1967; Noonan, 1976; Postlethwaite, 1967), which was later followed by the Second International Mathematics Study (SIMS, 1980–82, Rosier & Keeves, 1991; Postlethwaite & Wiley, 1992). The FIMS population were 13-year-old students and preuniversity students. Based on the experience gained through the Pilot Twelve-Country Study, more rigorous sampling procedures were implemented in the scope of FIMS. The instruments were pretested and modified in an extensive way and more standardized procedures for test development were implemented. Moreover, efforts were undertaken to provide theoretical foundations for the study by electing experts for specific topics, allocating a graduate student to support the scientist in charge of the literature review and the analysis, as well as by conducting a meeting devoted to hypothesis development. Two major study-related publications were released in the same year as FIMS: the two-volume final report “International Study of Achievement in Mathematics” edited by Torsten Husén (1967a, b) and a study by Neville Postlethwaite titled

8

Comprehensive Frameworks of School Learning in ILSAs

157

“School Organization and Student Achievement” (Postlethwaite, 1967). In the former, the importance of suitable interpretations of the results was stressed, as it was in the pilot study. The authors emphasize that “the primary interest was not on national means and dispersions in school achievements at certain age or school levels.” Comparisons of national levels of subject matter performance are considered as “meaningless” (Husén, 1967a, p. 30). Instead, it is stressed that the major interest is to compare “input” factors (at the national, school, and individual level), which might be relevant for student learning. The interest was on the “ ‘outcomes’ of various school systems,” while considering education as part of a larger socialpolitical-philosophical system (Husén, 1967a, p. 30, 65, 69). The added value of an international compared to a national assessment was seen in its contribution toward: (1) a deeper understanding of how “educational productivity” is related to instruction and societal factors; (2) shedding “new light upon the importance of the school structure” and the way it “mirrors influences from the society”; (3) the diverse country-specific methods of teaching mathematics and the place of mathematics within each of the national curriculum (Husén, 1967a, p. 31). Thus, major attempts were undertaken to develop measures comparable across countries to describe both the inputs and the outputs through suitable sampling and assessment procedures (Husén, 1967a, p. 32). The theoretical foundations presented in the final report developed over 4 years from the work of a group of experts from various countries participating in the study. The group developed a set of hypotheses, split into three categories: (1) school organization, selection, and differentiation; (2) curriculum and methods of instruction; (3) sociological, technological, and economic characteristics of families, schools, or societies (Husén, 1967a, p. 32). The conceptual scheme was further elaborated in a cooperative effort between the Hypothesis Committee and the study council. Considering the larger social-political-philosophical system, a conceptual scheme was provided encompassing six research areas: value and philosophy; policy (including education policy and the broader policy context); educational practices; cognitive learning outcomes; affective learning outcomes; and general attitudes and values (Husén, 1967a, p. 70). Notably, these areas are very broad with some of them being only implicitly related to student outcomes or processes within school. In particular, this pertains to value and philosophy and broader policy context. Also, the measurement of variables on school organization, curriculum, instruction, and societal factors was critically viewed as “rather crude” and one major aim was to identify the most important factors and, in consequence, to narrow the scope and to develop refined measures in upcoming projects (Husén, 1967a, p. 32). In the second volume of the final report, the theoretical frameworks are presented followed by recent empirical findings for each reported topic. However, the authors themselves acknowledge that while some of the areas “can be studied in a relatively sophisticated way, such as school achievement,” other areas “can be described only in common sense terms, such as the elements of national policy” (Husén, 1967a, p. 65) or school organization (Husén, 1967b, p. 56). Thus, the results should be interpreted with caution. The theoretical framing for the analysis of achievement as well as interests and attitudes refers to the inquiry-based learning techniques

158

A. Stancel-Piątak and K. Schwippert

implying active participation of students versus the more traditional approaches based on repetition and memorization. OTL was analyzed on the one hand using information gathered via a teacher questionnaire, focused at the classroom level (i.e., teachers’ perception of the students’ opportunity to learn the mathematics involved in the test items) and including an indirect measure on the respective education policy (i.e., teachers’ perceptions of the national emphasis on each assessed topic in each school program). Moreover, the total time on schooling was calculated using information from the student questionnaire on the opportunities given by (a) the time spent on school work and (b) home work, and (c) special opportunities provided by schools to some students, e.g., lectures, mathematics clubs, individual extra work. Mathematics instruction was considered as a very important confounder of the relationship between time spent on learning and student achievement. It is justified to state that the theoretical conceptualization of FIMS, although embedded into the input-output paradigm, to some extent outpaced the theoretical developments of that time. This pertains in particular to the comprehensive perspective and the inclusion of the national context into the theoretical conceptualization of student learning at school. As described earlier in this chapter, it was not until the 1980s that the ecological perspective began to influence learning and teaching theories on a wider scope, which resulted in the inclusion of several context levels into the models of school learning (e.g., Creemers & Kyriakidēs, 2008b; Creemers & Reezigt, 1996; Stringfield & Slavin, 1992). Also notable was that already in this very first implementation of an ILSA, the “output” variables were not limited to cognitive outcomes (i.e., mathematics) but were supplemented by the information gathered on student attitudes toward education. The study-related publications reveal that the authors recognize the limitation that the only subject assessed is mathematics. As the final report for FIMS was being produced, the Six-Subject Study was already in the planning phase and the expectations of the knowledge to be generated from this upcoming study were high (Husén, 1967a, p. 15).

Six-Subject Study and FISS (First International Science Study; 1970–71) The Six-Subject Study (Bloom, 1969) included assessments of literature education and reading comprehension, which laid the foundations for the subsequent International Reading Literacy Study (IRLS, 1990–91) and Progress in International Reading Literacy Study (PIRLS, cyclical from 2001 on). Moreover, science assessment (First International Science Study, FISS) was also included as a component of the Six-Subject Study and was followed up in the scope of the Second International Science Study (SISS, 1983–84). The component on civic education provided foundations for the subsequent Civic Education Study (CIVED, 1999) and the following International Civic and Citizenship Education Study (ICCS, cyclical from 2009). Further assessments included: English as a foreign language and French as a foreign language. The Six-Subject Study included 10-year-old students, 14-year-old students, and students in the final year of secondary school.

8

Comprehensive Frameworks of School Learning in ILSAs

159

The research aimed at extending the knowledge gained from the Pilot TwelveCountry Study and FIMS, focusing its efforts on (1) providing a strong and detailed theoretical framework and (2) establishing standardized procedures for instrument development, which would allow for a coordinated contribution from each of the participating education systems (Bloom, 1969, pp. 9–1, 12–2). To provide detailed and robust concepts, an aim of the preparation meetings was to bring outstanding scholars from various social sciences together “to review the type of research being undertaken, and at the same time to suggest from their own discipline point of view the type of hypotheses (. . .) together with the types of variables (. . .)” (Bloom, 1969, pp. 1–10). The results of this meeting were passed to the scientists responsible for questionnaire development. Aside from developments to the conceptual framework, major efforts were undertaken to strengthen the cross-national validity of test instruments by establishing expert groups in each country (approximately 20), developing national expert papers and international specifications (including reviews at the national level), and empirical validation of test material. The study documentation indicates that although the inability of testing causal hypothesis with cross-sectional data is acknowledged, nevertheless, associations are perceived at least as an indication of a possible effect and thus, the overall goal of the study is described as to extract “malleable factors,” which “have substantial effect” on the outcomes of students (Bloom, 1969, pp. 12–3). As in FIMS, the hypothesis in the Six-Subject Study is split into different categories. These categories, however, more closely mirror the actual structure of the school system and the system levels commonly defined in recent ILSAs: 1. The social-political context including: home and community and educational policy of the system 2. School context including: school factors 3. Classroom learning environment including: curriculum and instruction and teacher characteristics and student characteristics The education system is not only perceived in the context of the larger socialpolitical-philosophical system, as it was already described in FIMS (Husén, 1967a), but it is regarded as a “sub-system” of the social system. With this, the hierarchical nature of the different levels of the system is not only implicitly embedded into the theoretical framing and interpretation in the Six-Subject Study, but rather explicitly conceptualized as such (Bloom, 1969, pp. 9–2). The hierarchal structure is further acknowledged through level-specific analysis: “between students within schools,” “between schools,” or “between national systems.” Following the approach in FIMS, the outputs of the system in the Six-Subject Study include attitudes and skills beyond the specific content knowledge. Carroll’s model substantially contributed to the theoretical framing enabling the systematic inclusion of measures for all variables relevant for student learning (Carroll, 1975, p. 42). The “quality of opportunity” is explicitly included as an input variable.

160

A. Stancel-Piątak and K. Schwippert

In the study report, the questionnaire development process is summarized for each subject, and the theoretical foundations discussed at the meetings mentioned earlier are further elaborated (Bloom, 1969, pp. 9–1). Also, the challenges of the item construction along with chosen strategies to overcome them are presented in more detail. For each subject matter, detailed hypotheses were developed (Bloom, 1969, Appendix I). Further publications, which followed during the upcoming decade, provided in-depth analysis based on the data collected in the scope of the Six-Subject Study.

SIMS: Second International Mathematics Study (1980–82) and SISS: Second International Science Study (1983–84) The Second International Mathematics Study (SIMS, 1980–82,Postlethwaite & Wiley, 1992; Rosier & Keeves, 1991) as well as the precursory FIMS, were among the most influential early studies assessing mathematics. In line with FIMS, the core population of SIMS were 13-year-old students and preuniversity students. Along with the science assessments FISS and SISS, the mathematics assessments FIMS and SIMS provided foundations for the science and mathematics assessments in the Third International Mathematics and Science Study (TIMSS 1995), which is being conducted cyclically until recently (TIMSS 1995: Third International Mathematics and Science Study; TIMSS 1999: Third International Mathematics and Science Study – Repeat; TIMSS 2003 and subsequent cycles: Trends in International Mathematics and Science Study). Further, the assessment of computer literacy can be tracked back to SIMS, in which students’ skills in computer science were assessed. The topic was then followed up in the scope of the Computers in Education Study (COMPED, 1989, 1992), followed by the Second Information Technology in Education Study (SITES, 1998–1999) and the International Computer and Information Literacy Study (ICILS, cyclical from 2013 on). The assessment framework of SIMS was based on the OTL concept, which implied that the curriculum has three manifestations: what society would like to see taught (the intended curriculum), what is actually taught in the classroom (the implemented curriculum), and what the students learn (the attained curriculum) (Travers & Westbury, 1989). The curriculum was perceived as a broad explanatory factor underlying student achievement. Building upon this concept, the framework was extended to embrace a definition of the curricular context and curricular antecedents for each of the three curricula. Thus, the model incorporates the hierarchical and multidimensional structure of the education system providing a comprehensive frame for the study as well as for the data analysis and interpretation. Implemented as a follow-up of FISS, the Second International Science Study (SISS, 1983–84, Anderson, 1989; Jacobson & Doran, 1988; IEA, 1988) assessed the science achievement of students aged 10 and 14, and science students in the final year of secondary education. Analog to the mathematics studies, SISS was grounded on their former counterpart (FISS) excelling by the substantial progress of the developments pertaining, among others, to the conceptual framing. Thus, also in this study, the links to the concept of OTL were made more preeminent with an explicit note of the intended, implemented, and achieved curriculum (Jacobson & Doran, 1988, p. 19). Building upon the FISS framework (Keeves, 1974) the

8

Comprehensive Frameworks of School Learning in ILSAs

161

so-called IEA foundational curriculum model was further extended and sharpened in SISS (Rosier & Keeves, 1991, p. 5). To collect information on the intended curriculum, the science curricula (its aims and objectives) of the participating countries and its coverage in textbooks, syllabi, and reference materials were reviewed and used for the assessment’s development. The implemented curriculum was assessed via a teacher questionnaire in which teachers were asked to indicate the coverage of the content included in the actual science assessment. Finally, the achieved curriculum was represented by the scores on the achievement tests. In order to gather in-depth information on factors, which may affect science achievement, a case study for each participating country was conducted collecting such information as: organization and funding, types of schools and examinations, aims and objectives, approaches to curricula development, content of teacher education, etc. (Jacobson & Doran, 1988, p. 20).

Subsequent ILSAs: IEA’s TIMSS and PIRLS and OECD’s PISA and TALIS TIMSS: Third International Mathematics and Science Study (1995) and Trends in International Mathematics and Science Study (4-Years Cycle from 1995 on) Early IEA ILSAs laid the foundations for the subsequent ILSAs conducted from the late 1990s on until today. The first cycle of TIMSS (Martin & Kelly, 1996) can be sensed as the most ground-breaking of these. The methodology applied in this study was implemented on a large scale (45 participating education systems) using rigorous standardized procedures, rotated design, and sophisticated analysis methods. Since 1995, TIMSS has been conducted periodically (1995, 1999, 2003, 2007, 2011, 2015) with a slightly changed name from the third cycle on. While in earlier cycles the study assessed the achievement of third or fourth and seventh or eighth grade students in mathematics and science, it focuses on the fourth and eighth grade students in recent cycles. TIMSS 2008 and 2015 extended the assessment framework to include tasks suitable for assessing students with special preparation in mathematics and physics in the last year of upper secondary school. The construction of the overarching conceptual model and the instrument development in TMISS 1995 was the main task of a preceding small-scale international project called Survey of Mathematics and Science Opportunities (SMSO) founded by the National Science Foundation and the US National Center for Educational Statistics. Following the approach established in SIMS (Rosier & Keeves, 1991; Postlethwaite & Wiley, 1992), the OTL concept was put at the core of the study (see Schmidt et al., 1996; Schmidt & Cogan, 1996). Accordingly, the curriculum was perceived as an explanatory factor underlying student achievement while distinguishing the intended, implemented, and attained curriculum from each other (Martin & Kelly, 1996, pp. 1–4). In the TIMSS 1995 theoretical foundation, the OTL concept was extended to educational opportunity encompassing various factors at different levels of the system (education policy, school, classroom, and student) assumed to create a learning environment in which educational opportunity is provided to students

162

A. Stancel-Piątak and K. Schwippert

(Schmidt & Cogan, 1996). The school-related concepts were based on the indicator model of school processes (Porter, 1991), a modified version of the model of Shavelson et al. (1987), which describes school processes within the input-processoutput paradigm. Factors that influence instructional practices were based on respective research reviews (Prawat, 1989a, 1989b). Student characteristics were derived from literature reviews conducted by an expert group. The link between the theoretical framing and the instruments is provided in a very clear manner in the Technical Report (Schmidt & Cogan, 1996, pp. 5–10). The system, school, classroom, and student characteristics are organized within a comprehensive conceptual framework, educational experience opportunity, along four dimensions that could be summarized as: (1) learning goals, (2) teacher characteristics, (3) organization, and (4) learning outcomes (Martin & Kelly, 1996). Educational system is perceived as a multifaceted system embedded into the larger societal system. Relationships between factors at different levels of the system are specified as reciprocal as well as unidirectional. Learning opportunities and experiences of students are assumed to be influenced by these factors interplaying with each other and guided by the national curriculum, which may emphasize certain opportunities to learn and constrain others. The curriculum framework presented in the TIMSS 1995 Technical Report has been briefly summarized in the documents of the subsequent cycles. For instance, TIMSS 1999 Technical Report mentions briefly different areas of the school system (Martin et al., 2000, p. 306). The report provides a list, in which the system, school, teacher, and student level are considered along with the curricular context of students’ learning and instructional organization and activities. Although the variables are being modified for each cycle to capture societal changes, the overall concept is based on the previously developed model of educational experience opportunity (Martin & Kelly, 1996). To enable for trend comparisons, the modifications on measurement instruments are kept limited to the most necessary changes.

IRLS: International Reading Literacy Study (1990–91) and PIRLS: Progress in International Reading Literacy Study (5-Year Cycle from 2001 on) IRLS (Elley, 1993, 1994) was a forerunner of PIRLS (cyclical from 2001 on; Binkley et al., 1996). The core population were 9-year-old and 14-year-old students. The study was framed around the concept of effective schools, a concept that relied on the basic assumption that a school can be perceived as “effective” if the achievement of students is higher than the average of schools that act within similar conditions related, for instance, to student composition and available resources (Postlethwaite & Ross, 1992). Several levels and areas of the school system are listed in the framework, starting from the student level through teacher and class characteristics up to the level of the school and the surrounding area. While the list is quite comprehensive, an overarching structure is not in the focus of the framework. Clearly, the stress is on the development of an international assessment of reading comprehension and analyzing it in relation to relevant school, classroom, and teacher characteristics.

8

Comprehensive Frameworks of School Learning in ILSAs

163

PIRLS builds upon IRLS, but takes the assessment further implementing rigorous standardized methods and a comprehensive approach to student learning in a similar manner to TIMSS 1995. PIRLS was implemented from 2001 in a 5-year cycle (2001, 2006, 2011, 2016, 2021) with the core population of fourth grade students. In the last two cycles, the PIRLS test was extended to enable more precise measures of low achieving children (PIRLS-E). Notably, the study frameworks for PIRLS are published before the study implementation of each cycle, usually at least 1 or 2 years ahead of the data collection. For example, the framework for PIRLS 2021 has already been published in 2019. Aside from the theoretical foundation for the assessed domains of reading literacy, the framework presents briefly the contextual factors assessed via student, parent, teacher, and school questionnaires. While the older frameworks (of the 2001, 2006, and 2011 cycle) provide a theoretical conceptualization, which mirrors in a simplified manner the models from TIMSS 1995, the newer frameworks abstained from further explorations or extensions. The model presented in the first three cycles (2001, 2006, and 2011) reflects the idea of student learning being embedded into the learning context and being influenced by the classroom and teacher, school factors, and the wider community context. The description of the relationships between those factors mirrors the kind of inferences as inclined by the inputprocess-output paradigm. Interestingly, while the 2001 and 2006 cycle frameworks acknowledge the reciprocal nature of effects between various factors of student learning and student outcomes, in the 2011 cycle framework a unidirectorial influence of schooling factors on student achievement is implied, while the interactions between home, school, and classroom are still assumed to be of a reciprocal nature. From its second cycle onward PIRLS included noncognitive outcomes (student attitudes and behaviors) extending the concept of school quality or effectiveness to encompass other outcomes aside from cognitive.

PISA: Programme for International Student Assessment (3-Year Cycle from 2000 on) In terms of the scope of assessed domains of student learning, PISA is the most comprehensive OCED study conducted in a 3-year cycle from 2000 onward. It encompasses reading, mathematics, and science literacy in each cycle with additional skills specific to each cycle (e.g., global competence in 2018). The frameworks of each cycle are organized around the assessed domains, which were the core focus of previous PISA cycles. It provides rationale and, over the cycles, increasingly comprehensive and elaborated scientific foundation for each of the assessed domains. Comparing the study frameworks from the seven implemented cycles until 2018, it can be noted that the core interest gradually shifted from a solely achievement domain focused toward a more comprehensive approach to factors related to student learning, including noncognitive student outcomes such as attitude, motivation, and disposition. In this respect, the developments reflect the overall developments within SER and EER described at the beginning of this chapter. The framework of the 2018 cycle provides a comprehensive and elaborated overview of the system levels (teacher/classroom, school, and education policy) and respective

164

A. Stancel-Piątak and K. Schwippert

factors. The factors are organized according to the input-process-output paradigm (student background, learning process in context, outcomes). Notably, the 2000–2006 cycle frameworks contain very little reference to the contexts of student learning, while this topic was increasingly focused on from 2012 on. An explicit reference to the input-process-output paradigm can be found in the framework from the 2012 cycle onward. While the complex and reciprocal nature of relationships between different system factors is acknowledged, the organization of learning factors around the input-process-output paradigm implicitly indicates causality. Thus, in the 2018 framework, the study conductors replaced the input-process-output-oriented nomenclature by less causal-oriented terms. The most appealing example pertains to the replacement of the category “process” in Fig. 6.2 in 2015 and 6.1 in 2012 (OECD, 2017, Pisa 2012 assessment and analytical framework, 2013) by “schooling constructs” in 2018 (OECD, 2019). Along with these developments, it can also be noted that the study frameworks developed with respect to the definition, scientific foundation, and elaboration of measures of the background and contextual factors of student learning. Despite this, the factors are organized and structured around policy issues and the instrument rather than around a specific theoretical concept, which would allow for an elaborated description of the school system (OECD, 2019). The system and individual-level factors are a selection of factors considered relevant for student learning. Their relevance is based on policy interest and expert opinions supplemented by an overview of selected research findings.

TALIS: Teaching and Learning International Survey (5-Year Cycle from 2008 on) TALIS collects information about teachers, teaching conditions, and learning environments in a 5-year cycle (2008, 2013, 2018). The core population are teachers of ISCED (International Standard Classification of Education) level 2 students, with ISCED level 1 and 3 teachers included in some countries. TALIS, along with the Starting Strong Survey are part of a larger TALIS program, which involves activities such as initial teacher preparation, a teacher knowledge survey, and a video study. The study’s framework aims at providing “TALIS 2018 with an integrated theoretical and policy underpinning that articulates the study’s research focus and its links to existing knowledge and evidence,” and is supposed to “guide the development of the study’s survey instruments and operations and identify the methods used” (Ainley & Carstens, 2018, p. 10). Reviewing the study publication, it is evident that the theoretical framing of the study was progressively developed over the last three cycles. While in 2008, the study framework contained a short description mostly dedicated to policy issues (OECD, 2010; Rutkowski et al., 2013), the theoretical framework in 2013 has been substantially extended by an international research group contributing to the various aspects of the study (OECD, 2010; Rutkowski et al., 2013). Compared to its forerunners, the 2018 study framework (Ainley & Carstens, 2018) provides a more elaborated and systematic view on the school system and its factors. While in 2008 and 2013 the assessed system characteristics are organized solely around

8

Comprehensive Frameworks of School Learning in ILSAs

165

policy issues (which are scientifically more elaborated in 2013 than in 2008), the third cycle’s framework adds knowledge from research findings on effective teaching and learning conditions. Effectiveness is defined in very broad terms as “the extent to which a given activity’s stated objectives are met,” while effective teaching and learning environments are considered as “elements that contribute to student cognitive and affective learning” (Ainley & Carstens, 2018, p. 28). The selection of indicators is based on prior research on aspects of the teaching and learning environment that contribute to positive student learning. The framework acknowledges that many factors of effective teaching and learning are not included in TALIS 2018 and that their collection would require other methods than the selfreports used in the study. The current TALIS 2018 conceptual framework addresses themes and priorities related to professional characteristics and pedagogical practices at the institutional (school leadership and climate, human resources and stakeholder relations) and individual level (pedagogical practices, teacher education, feedback and development, self-efficacy, and job satisfaction and motivation). Innovation, equity, and diversity are mentioned as being of particular “policy and research” interest. The TALIS 2018 conceptual framework consists of three main components: themes, indicators, and an analytic schema. It provides a description of each theme accompanied by a short literature review of research and proposes indicators. The first part preliminary focuses on the description of indicators for system monitoring, policy considerations, and the prioritization of themes for the assessment, the latter being a result of the OECD’s priority rating exercise conduced together with member countries, partner countries, and economies that had expressed interest in taking part in the survey, as well as the European Commission. The links of the themes to “policy issues” are presented as well as the link of TALIS 2018 to other OECD studies. In the second part, the themes are mapped on a two-dimensional matrix with the level of analysis being the first (teacher and institutional level) and the focus being the second dimension (with the categories: professional characteristics and pedagogical practices). The framework refers to the current development in educational effectiveness research concerning the adoption of dynamic models of school effects (Creemers & Kyriakides, 2015) and acknowledges the reciprocal nature of relationships between factors at different levels in the school system.

Early Childhood Education and Care (ECEC) Preprimary Project (1987–1989, 1992, 1995–97) and the IEA ECES: Early Childhood Education Study (2015) The Preprimary Project (1987–1989, 1992, 1995–97) was designed as a longitudinal study to explore the quality of life of 4-year-old children in care and educational environments, and to assess how these environments affected their development (Katz, 1990; Leimu et al., 1992). The project consisted of two phases. During the first phase (1987–1989), data on early childhood education and care (ECEC) provision was collected on the national level as well as via household survey. During the

166

A. Stancel-Piątak and K. Schwippert

second phase (1992), observational data was collected and interviews were conducted to assess children’s development and their family background. Finally, in the third phase (1995–1997), data on children’s developmental stage at the age of 7 was collected. The major purpose was threefold: (1) to produce profiles of national policies on the care and education of young children and to identify and characterize the major early childhood care and educational settings; (2) to explore the impact of programmatic and familial factors on the development of children; and (3) to examine the relationship between early childhood experiences at age 4 and children’s cognitive and language development at age 7, all of which were relevant to primary school performance and success. This experience was used in the scope of the Early Childhood Education Study ECES, 2015. The ECES (Bertram & Pascal, 2016) target population for the assessment module consisted of children attending center-based education and care in the final year of ISCED 0. The purpose of ECES is to explore, describe, and analyze the provision of ECEC and its role in preparing children for the learning and social demands of school and wider society. The study was designed to provide meaningful information for countries, states, and jurisdictions in relation to how ECEC contributes to children’s outcomes. The study was divided into two phases, with the first phase focusing on the national policy context for ECEC. In this phase data was collected about the policy aims and goals, delivery models and providers, access and participation, quality, and expectations for outcomes. During the second phase, information on children’s competencies at the end of early childhood education was collected and complemented by contextual data on settings, leaders/managers, practitioners, and parents. While the first phase was completed (Bertram & Pascal, 2016), the second phase was put on hold due to the insufficient number of participating countries. The theoretical conceptualization draws on models of student learning developed within the effective classroom research, assuming that the structure of early childhood education settings and the school system substantially resemble each other. The theoretical framework refers particularly to Huitt’s revised transactional model of the teaching/learning process (Huitt, 1997a, 2003; Huitt et al., 2009) described earlier in this chapter, which was adapted for the purpose of the study (Stancel-Piątak & Hencke, 2014). The choice of this model for the purpose of this study was justified by its major advantages: first, it emphasizes pedagogical interaction between the teacher and the student/child; second, it considers the learning context of the school or ECEC setting and the family background of the children, as well as the wider regional and institutional context; and third, it includes different measures of learning beyond basic skills (e.g., self-esteem, motivation) permitting for an extension to the ECEC context in line with the holistic view on child development.

OECD Starting Strong Survey (2018) The OECD’s Starting Strong Survey was conducted in its first cycle in 2018 and is the first international large-scale study in the ECEC context planned to be implemented cyclically. The study’s aim is to provide “international indicators

8

Comprehensive Frameworks of School Learning in ILSAs

167

and policy-relevant analysis on ECEC staff and centre leaders, their pedagogical and professional practices, and the learning and well-being environments in ECEC centres” (Sim et al., 2019, p. 12). The study design and framework were developed based on the experiences gathered in the scope of previous experiences of assessments in this area. The framework explicitly focuses not only on learning aspects but also on child well-being, a perspective grounded in the holistic approach to child development (LMTF, 2013). The selection of the respective environment and workforce indicators are assumed to be related to children’s positive development and learning outcomes based on an extensive literature review implemented by an expert group from the participating countries (Sim et al., 2019). The survey aims explicitly at the greatest possible alignment with TALIS 2018 while acknowledging differences between the school and the ECEC systems. The survey’s framework is similar to TALIS 2018. Indicators for system monitoring are presented in the first part of the conceptual framework along with policy considerations, and the prioritization of themes. Based on the experience of IEA’s ECES, the framework Starting Strong Survey used the adapted version of Huitt’s revised transactional model of the teaching/learning process (Huitt, 1997a, 2003; Huitt et al., 2009) as a starting point to develop a conceptual model for purpose of the survey and suitable for the analysis of child well-being and learning in early childhood education settings (Sim et al., 2019).

Conclusions and Recommendations Concepts on student learning have developed from different traditions and research strands discussed in the first part of this chapter. Research in the areas of effective learning, effective classroom research, teacher effectiveness, and educational and school effectiveness have led to attempts for a more comprehensive view on student learning at school. Despite these developments, research in this area is still very diverse with respect to methods, theories, and approaches. This is not only due to different research traditions and a wide range of perspectives, but also because the research questions and approaches are as culturally dependent as the school systems themselves. Additionally, school systems undergo continuous developments and changes – currently predominantly due to the globalization of education and related policy discourses in which ILSAs play an important role. The continuous expansion of research and policy-driven interest in ILSAs resulted, on the one hand, in an increasing demand for reliable measures corresponding with a demand for high quality documentation of the results. ILSAs, therefore, especially improved the international comparability of data and their usefulness for national applications of their results. On the other hand, increasingly comprehensive and generic theoretical models have caused ILSAs to implement complex survey methods (e.g., rotated designs) to cope with the amount and intricacy of information required to provide a comprehensive and reliable picture. Theoretical improvements and empirical studies contributed to provide more precise

168

A. Stancel-Piątak and K. Schwippert

and nuanced descriptions of factors at all levels of the education system considered relevant for student school learning. Concerning data analysis, complex methods are increasingly necessary to address the multitude of factors and their interactions, such as those reflected in latent multilevel models with cross-level interaction effects. In light of these developments, the implementation of increasingly complex methodology under the conditions of increasingly rigorous frameworks has become a constant challenge for each upcoming ILSA cycle. Regarding ILSA frameworks, the theoretical developments are to a great extent mirrored in the study frameworks, which consequently tend to increase in complexity over time. The extent to which new hypotheses are based on prior empirical findings, and to which study frameworks are being updated and refined, differs between studies. It is obvious that cyclical applications should consider both the need for development on the one hand, and comparability across cycles and sustainability to allow for trend analyses on the other. While some cyclical studies experienced extensive development in theoretical conceptualization over the years, others put a greater emphasis on trend analysis, relying on their prior conceptualization. The very early ILSAs (IEA’s Twelve-Country Pilot Study and FIMS) strongly emphasized the development of procedures and instruments that would allow for cross-country comparability. Subsequent early studies drew on experiences gathered in these first applications, extending efforts to provide a sound theoretical framework. This can be noted consistently throughout the early IEA studies. Since 1995, the aims of implemented ILSAs seemed to concentrate more on the goal of ensuring comparability for trend analysis, implementing only minor changes and additions. This is true for TIMSS as well as for the PIRLS assessments. The development of the OECD studies on education (PISA and TALIS) differed slightly in this respect. While trend aspects are maintained from the beginning in PISA, the theoretical conceptualization of the study has been developed gradually over the course of its cycles, with the most comprehensive and extensive reviews in the recent PISA 2018 cycle. A similar, but less preeminent, pattern can be observed in TALIS. As a very young study, the Starting Strong Survey includes state-of-the-art theoretical conceptualization and an extensive literature review, as is the case with other more recent OECD ILSAs. Despite these differences, it can be stated that ILSA conceptual frameworks also have some commonalities with respect to their theoretical framing, of which their foundation in the input-process-output paradigm is probably one of the most obvious. At the same time, in most studies, the reciprocal relationship between factors at different levels of the system is acknowledged, as emphasized by state-of-the-art theories on educational effectiveness. While in the older IEA studies the relevant factors for student learning were collected from different theoretical perspectives by a group of experts, TIMSS 1995 used a theory-based approach to systematize school learning and teaching factors and to provide a conceptual and analytical framework. From a very early stage on, IEA studies have also included noncognitive outcomes and, in this regard, to some extent surpassed the theoretical developments present at that time. This pertains especially to a comprehensive perspective and the inclusion

8

Comprehensive Frameworks of School Learning in ILSAs

169

of national contexts into the theoretical conceptualization of student learning at school, a perspective implemented in learning and teaching theories on a wider scale only later, in the 1980s. While many of the challenges have been addressed to a varying degree by different studies (e.g., cross-cultural comparability), there is one major limitation which was pointed out already in the Six-Subject Study (1970–71), and which has not been properly addressed to date. This pertains to the discussion on causality in cross-sectional studies. Although methodological approaches have been proposed in order to enable a closer approximation of causal effects with cross-sectional studies (e.g., matching methods or differences in differences), it remains challenging, due to high demands in the implementations, limited interpretations, and as yet insufficient simulation studies. Nevertheless, due to the dynamic nature of education systems (reflected in reverse causality, also called reciprocal determinism, simultaneous effects, or recursive effects (Scheerens & Bosker, 1997).), and the fact that policy action most often causes long-term – rather than short-term – effects, the need for longitudinal data for within-country analysis cannot be neglected. Future challenges will pertain to the dynamic nature of processes and the apparent inability to capture them sufficiently well with cross-sectional data. The review of study frameworks presented here – their coverage and inclusion of theoretical models – revealed inclusive and comprehensive approaches as another key challenge. With the increasing complexity and comprehensiveness of the theoretical models, there is a need for a systematic overarching approach for developing research agendas in ILSAs. In addition to offering the possibility of identifying research gaps and providing a more comprehensive picture of education systems, it would also contribute to avoiding over-testing of certain aspects or populations. Looking at recent ILSAs, a great diversity of themes, approaches, foci, and populations assessed can be discerned. A systematic identification of these elements would therefore allow for more elaborate and complementary conclusions. This would also include an identification of areas and themes which should be included in each study to enhance comparability across studies and populations. Another long-term solution could be assessing selected themes on a bicyclic basis to reduce the testing time. This relates particularly (but not only) to the higher levels of the system, as their contribution to student learning can only be observed in the long term. Related to this, trend analyses might become one of the most promising type of reporting from ILSAs in future. Considering that the selection of ILSAs reviewed in this chapter is limited by the scope of the handbook, conclusions should not be generalized based on these boundaries. Other chapters of this handbook will present more insight into instrument development, as well as results from ILSAs, which were not considered in this chapter. Moreover, the reviewed theoretical approaches focused on models of school learning, with particular emphasis on educational effectiveness models. The links provided to socio-psychological (Spiel et al., 2010) and economical (Hanushek et al., 2016) models are included only occasionally, where appropriate. Conclusions from the presented review may be limited; however, they are of importance for a

170

A. Stancel-Piątak and K. Schwippert

wide range of aspects reflecting on the extent to which the theoretical frameworks of the major ILSAs comprehensively mirror systemic theoretical frameworks. In addition, the chapter points out major research gaps with the exploration of the dynamic nature of the processes presumably deserving the greatest attention.

References Ainley, J., & Carstens, R. (2018). Teaching and learning international survey (TALIS) 2018. Conceptual framework. OECD. (OECD Education Working Papers, 187). Anderson, C. A. (1961). Methodology of comparative education. International Review of Education, 7(1), 1–23. Available online at http://www.jstor.org/stable/3441689 Anderson, O. R. (1989). The teaching and learning of biology in the United States. Second IEA Science Study. Columbia University. (Second IEA Science Sudy SISS). Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory. Prentice Hall. Bertram, T., & Pascal, C. (2016). Early childhood policies and systems in eight countries. Springer. Binkley, M., Rust, K., & Williams, T. (Eds.). (1996). Reading literacy in an international perspective. Collected papers from the IEA reading literacy study. National Center for Education Statistics. Bloom, B. S. (1969). Cross-national study of educational attainment: Stage I of the I.E.A. Investigation in six subject areas. Final report (Vol. I). University of Chicago. Bronfenbrenner, U. (1978). Ansätze zu einer experimentellen Ökologie menschlicher Entwicklung. In R. Oerter (Ed.), Entwicklung als lebenslanger Prozeß (pp. 33–65). Hoffman und Campe. Bronfenbrenner, U. (1981). Die Ökologie der menschlichen Entwicklung. Natürliche und geplante Experimente. Klett-Cotta. Carroll, J. B. (1963). A model of school learning. Teachers College Record, 64(8), 723–733. Carroll, J. B. (1975). The teaching of French as a foreign language in eight countries. Almqvist & Wiksell. Coleman, J. S., Campbell, E. Q., Hobson, C. J., McPartland, J., Mood, A. M., Weinfeld, F. D., & York, R. (1966). Equality of educational opportunity. U.S. Deprtment of Health, Education, and Welfare, Office of Education. Creemers, B. P. M. (1994). The effective classroom. Cassell. Creemers, B. P. M., & Kyriakidēs, L. (2008a). The dynamics of educational effectiveness. A contribution to policy, practice and theory in contemporary schools/Bert P.M. Creemers and Leonidas Kyriakides. Routledge (Contexts of learning). Creemers, B. P. M., & Kyriakidēs, L. (2008b). The dynamics of educational effectiveness. A contribution to policy, practice and theory in contemporary schools. Routledge. Creemers, B. P. M., & Kyriakides, L. (2012). Improving quality in education: Dynamic approaches to school improvement. Routledge. Creemers, B., & Kyriakides, L. (2015). Process-product research. A cornerstone in educational effectiveness research. In. Journal of Classroom Interaction, 50(1), 107–119. Creemers, B. P. M., & Reezigt, G. J. (1996). School level conditions affecting the effectiveness of instruction. School Effectiveness and School Improvement, 7(3), 197–228. Creemers, B. P. M., Sheerens, J., & Reynolds, D. (2000). Theory development in school effectiveness research. In C. Teddlie & D. Reynolds (Eds.), The international handbook of school effectiveness (pp. 283–300). Falmer Press. Elley, W. B. (1993). The IEA reading literacy study. The international report. Pergamon Press. Elley, W. B. (Ed.). (1994). The IEA study of reading literacy. Reading and instruction in thirty-two school systems. International Association for the Evaluation of Educational Achievement. Pergamon Press. (International studies in educational achievement, v. 11).

8

Comprehensive Frameworks of School Learning in ILSAs

171

Hanushek, E. A. (2016). What matters for student achievement: Updating Coleman on the influence of families and schools. Education Next, 16(2), 18–26. Hanushek, E. A., Machin, S., & Woessmann, L. (2016). Handbook of the economics of education (Vol. 5). North Holland. (Handbooks in economics). Hattie, J. (2009). Visible learning. A synthesis of over 800 meta-analyses relating to achievement. Routledge. Huitt, W. (1997a). A transactional model of the teaching/learning process. Educational Psychology interactive. Valdosta State University. Huitt, W. (1997b). The SCANS report revisited. Valdosta State University. (Paper delivered at the Fith Annual Gulf South Business and Vocational Education Conference). Huitt, W. (2003). Educational Psychology interactive: Teaching/Learning Process Model, updated on 2/6/2014. Huitt, W., Huitt, M. A., Monetti, D. M., Hummel, J. H. (Eds.) (2009). A systems-based synthesis of research related to improving students’ academic performance. 3rd International City Break Conference sponsored by the Athens Institute for Education and Research (ATINER), October 16–19. Husén, T. (1967a). International study of achievement in mathematics: A comparison of twelve countries (2 volumes). Almqvist & Wiksell (1). Husén, T. (1967b). International study of achievement in mathematics: A comparison of twelve countries (2 volumes). Almqvist & Wiksell (2). IEA. (1988). Science achievement in seventeen countries. A preliminary report. International Association for the Evaluation of Educational Achievement (IEA). Jacobson, W. J., & Doran, R. L. (1988). Science achievement in the United States and sixteen countries. A report to the public. Columbia University. (Second IEA Science Study SISS). Jencks, C. (1973). Inequality. A reassessment of the effect of family and schooling in America. New York, London: Harper & Row (Harper colophon books, CN 334). Kast, F. E., & Rosenzweig, J. E. (1972). General system theory. Applications for organization and management. In. Academy of Management Journal, 15(4), 447–465. https://doi.org/10.2307/ 255141 Katz, L. G. (1990). Overview of the IEA preprimary project. Paper presented at the meeting of the Palais des Congres. ERIC. Katz, D., & Kahn, R. L. (1966). The social psychology of organizations. Wiley. Keeves, J. P. (1974). The IEA science project: Science achievement in three countries – Australia, the Federal Republic of Germany, and the United States. Implementation of Curricula in Science Education, 158–178. Kyriakides, L. & Demetriou, D. (2010). Investigating the impact of school policy in action upon student achievement. Extending the dynamic model of educational effectiveness. Presented at the second biennial meeting of the EARLI special interest group 18, 2010. Kyriakides, L., Creemers, B. P. M., & Charalambous, E. (2018). Searching for differential teacher and school effectiveness in terms of student socioeconomic status and gender. Implications for promoting equity. School Effectiveness and School Improvement, 30(3), 286–308. https://doi.org/10.1080/ 09243453.2018.1511603 Learning Metrics Task Force (LMTF). (2013). Toward universal learning. Montreal, Washington: Recommendations from the Learning Metrics Task Force. Leimu, K., Báthory, Z., Moahi, S., Luna, E., Watanabe, R., Hussein, M. G., et al. (Eds.). (1992). Monitoring the quality of education worldwide. A few national examples of IEA’s impact. UNESCO Publishing (84th ed.). UNESCO Publishing (Quarterly review of education). Lezotte, L. W. (1991). Correlates of effective schools. The first and second generation. Martin, M. O., & Kelly, D. L. (Eds.). (1996). Third international mathematics and science study (TIMSS). Technical report (Vol. 1). Boston Collage. Design and development. Martin, M. O., Gregory, K. D., & Stemler, S. E. (Eds.). (2000). TIMSS 1999. Technical report. International Study Center, Lynch School of Education, Boston College.

172

A. Stancel-Piątak and K. Schwippert

McDonell. (1995). Opportunity to learn as a research concept and a policy instrument. Educational Evaluation and Policy Analysis, 17, 305–322. Mullis, I. V. S., & Martin, M. O. (2013). TIMSS 2015 assessment frameworks. TIMSS & PIRLS International Study Center. Nachbauer, M., & Kyriakides, L. (2019). A review and evaluation of approaches to measure equity in educational outcomes. School Effectiveness and School Improvement, 1–26. https://doi.org/ 10.1080/09243453.2019.1672757 Noonan, R. D. (1976). School resources, social class, and student achievement: A comparative study of school resource allocation and the social distribution of mathematics achievement in ten countries. Doctoral thesis. Almqvist & Wiksell. OECD. (2010). Overview of TALIS 2008 and framework development. In TALIS 2008. Technical report (pp. 23–28). Organisation for Economic Co-operation and Development. OECD. (2019). PISA 2018 assessment and analytical framework. OECD Publishing. Organisation for Economic Co-operation and Development. (2019). PISA 2018 assessment and analytical framework. OECD Publishing (PISA). Organisation for Economic Co-operation and Development, publisher. (2017). PISA 2015 assessment and analytical framework. Science, reading, mathematic, financial literacy and collaborative problem solving (Revised ed.). OECD Publishing. Papanastasiou, C., Plomp, T., & Papanastasiou, E. C. (Eds.). (2011). IEA 1958–2008. 50 years of experiences and memories. Monē Kykkou (Cyprus). Cultural center. Cultural Center of the Kykkos Monastery. Available online at https://www.iea.nl/sites/default/files/2019-04/IEA_ 1958-2008.pdf PISA 2012 assessment and analytical framework. (2013). Mathematics, reading, science, problem solving and financial literacy. OECD. Porter, A. C. (1991). Creating a system of school process indicators. Educational Evaluation and Policy Analysis, 13(1), 13–29. Postlethwaite, T. N. (Ed.). (1967). School organization and student achievement: A study based on achievement in mathematics in twelve countries. Almqvist & Wiksell. Postlethwaite, T. N., & Ross, K. N. (1992). Effective schools in Reading. Implications for Educational Planners. An Exploratory Study. Postlethwaite, T. N., & Wiley, D. E. (1992). The IEA study of science II: Science achievement in twenty-three countries. Pergamon Press. Prawat, R. S. (1989a). Promoting access to knowledge, strategy, and disposition in students. A research synthesis. In. Review of Educational Research, 59(1), 1–41. https://doi.org/10.3102/ 00346543059001001 Prawat, R. S. (1989b). Teaching for understanding. Three key attributes. Teaching and Teacher Education, 5(4), 315–328. https://doi.org/10.1016/0742-051X(89)90029-2 Reynolds, D., & Teddlie, C. (2000). The process of school effectiveness. In C. Teddlie & D. Reynolds (Eds.), The international handbook of school effectiveness (pp. 134–159). Falmer Press. Reynolds, D., Sammons, P., de Fraine, B., van Damme, J., Townsend, T., Teddlie, C., & Stringfield, S. (2014). Educational effectiveness research (EER). A state-of-the-art review. School Effectiveness and School Improvement, 25(2), 197–230. https://doi.org/10.1080/09243453.2014. 885450 Rosenshine, B., & Stevens, R. (1986). Teaching functions. In M. C. Wittrock (Ed.), Handbook of research on teaching. A project of the American Educational Research Association (3rd ed., pp. 376–391). Macmillan Publishing Company. Rosier, M., & Keeves, J. P. (1991). The IEA study of science I. Science education and curricula in twenty-three countries. Pergamon Press. Rutkowski, D., Rutkowski, L., Bélanger, J., Knoll, S., Weatherby, K., & Prusinski, E. (2013). Teaching and learning international survey TALIS 2013. Conceptual framework. OECD.

8

Comprehensive Frameworks of School Learning in ILSAs

173

Scheerens, J. (2017). The perspective of “limited malleability” in educational effectiveness. Treatment effects in schooling. In. Educational Research and Evaluation, 23(5–6), 247–266. https:// doi.org/10.1080/13803611.2017.1455286 Scheerens, J., & Bosker, R. J. (1997). The foundations of educational effectiveness. Elsevier Science. Scherer, R., & Nilsen, T. (2018). Closing the gaps? Differential effectiveness and accountability as a road to school improvement. School Effectiveness and School Improvement, 30(3), 255–260. https://doi.org/10.1080/09243453.2019.1623450 Schmidt, W. H., & Cogan, L. S. (1996). Development of the TIMSS context questionnaires. In M. O. Martin & D. L. Kelly (Eds.), Third International Mathematics and Science Study (TIMSS). Technical Report (Vol. 1.: Design and development, pp. 5-1–5-22). Boston Collage. Schmidt, W. H., Jorde, D., Cogan, L. S., Barrier, E., Gonzalo, I., Moser, U., et al. (1996). Characterizing pedagogical flow. An investigation of mathematics and science teaching in six countries. Kluwer Academic Publishers. Seidel, T., & Shavelson, R. J. (2007). Teaching effectiveness research in the past decade: The role of theory and research design in disentangling meta-analysis results. Review of Educational Research, 77, 454–499. Shavelson, R. J., McDonell, L., Oakes, J., & Carey, N. (1987). Indicator systems for monitoring mathematics and science education. The RAND Corporation. Sim, M. P. Y., Bélanger, J., Stancel-Piątak, A. &, Karoly, L. (2019). Starting strong teaching and learning international survey 2018 Conceptual Framework (197). Slavin, R. E. (1987). A theory of school and classroom organization. Educational Psychologist, 22, 89–108. Slavin, R. E. (Ed.). (1996). Education for all. Sewts & Zeitlinger. Spiel, C., Schober, B., Wagner, P., & Reimann, R. (Eds.). (2010). Bildungspsychologie. Hogrefe. Available online at http://sub-hh.ciando.com/book/?bok_id¼42083 Squires, D., Huitt, W., & Segars, J. (1983). Effective classrooms and schools: A research-based perspective. Association for Supervision and Curriculum Development. Stancel-Piątak, A. (2017). Effektivität des Schulsystems beim Abbau sozialer Ungleichheit: Latentes Mehrebenenmodell individueller und institutioneller Faktoren der sozialen Reproduktion (PIRLS). [School effectiveness on mitigating social inequalities. Latent multilevel model of the individual and institutional factors of social reproduction (PIRLS)]. Empirische Erziehungswissenschaft: Band 63. Waxmann. Stancel-Piątak, A., & Hencke, J. (2014). Overview of IEA’s early childhood education study. In IEA Early Childhood Education Study. Study framework (unpublished IEA document) (pp. 5–19). International Association for the Evaluation of Educational Achievement (IEA). Stringfield, S., & Slavin, R. E. (1992). A hierarchical longitudinal model for elementary school effects. In B. P. MCreemers & G. J. Reezigt (Eds.), Evaluation of educational effectiveness (pp. 35–69). Groningen: ICO. Tatto, M. T., Ingvarson, L., Schwille, J., Peck, R., Senk, Sh. L., & Rowley, G. (2008). Teacher education and development study in mathematics (TEDS-M): Policy, practice, and readiness to teach primary and secondary mathematics – conceptual framework. Amsterdam, The Netherlands: IEA. Teddlie, C., & Reynolds, D. (Eds.). (2000). The international handbook of school effectiveness. Falmer Press. Travers, K. J., & Westbury, I. (1989). The IEA study of mathematics I: Analysis of Mathemtics curricula. Pergamon Press. UNESCO Institute for Education. (1962). Educational achievements of thirteen-years-olds in twelve countries. Results from an international research project, 1959–1961. With assistance of Arthur W. Foshay, Robert L. Thorndike, Fernand Hotyat, Douglas A. Pidgeon, David A. Walker. (International Studies in Education).

9

Assessing Cognitive Outcomes of Schooling Frederick Koon Shing Leung and Leisi Pei

Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conceptualization of Cognitive Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Similarities and Differences of Theoretical Frameworks Across Different ILSA Studies . . . . . Organization of Theoretical Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Content Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Cognitive Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Development of Cognitive Dimension Across ILSAs Over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mathematics Assessment Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Science Assessment Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reading Assessment Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implementation of the Frameworks in Item Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Format of Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of TIMSS and PISA Items’ Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . New Forms of Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussions and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

176 177 178 178 180 181 182 183 185 187 190 190 191 194 194 194 195 197 197

Abstract

The backbone of all International Large-Scale Assessments (ILSAs) in education is the assessment of educational outcomes over time and across a broad range of F. K. S. Leung (*) The University of Hong Kong, Hong Kong, Hong Kong SAR School of Mathematics and Statistics, Southwest University, Chongqing, China e-mail: [email protected] L. Pei The University of Hong Kong, Hong Kong, Hong Kong SAR e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_9

175

176

F. K. S. Leung and L. Pei

domains as well as grade levels. Educational outcomes are often conceived as falling into content and cognitive domains. While outcomes for content domains obviously differ from subject discipline to subject discipline (e.g., mathematics versus reading), what constitutes cognitive outcomes is however not a straightforward matter. The framework used for cognitive outcomes may be different for different subject disciplines, and even for the same subject discipline, the same cognitive framework may not be appropriate for students of different grade levels. Even for students of the same grade level, the conceptualization of cognitive outcomes may have changed over time and hence the expectation of the students also changes. Moreover, different studies may have different purposes and emphases and so may define cognitive outcomes differently. Given these diverse conceptualizations of cognitive outcomes, this chapter attempts to summarize different theoretical frameworks for cognitive outcomes in different subject disciplines for students of different grades underlying different studies over time. Then the operationalizations of the frameworks in terms of the item format and item features are analyzed, using the similarities and differences between the TIMSS and PISA items as examples. The implications of the analysis for our understanding of what we want to achieve in education in terms of students’ cognitive outcomes and how best such outcomes should be assessed in an international context are then discussed. Keywords

International Large-Scale Assessments (ILSAs) · Cognitive outcomes · Item design · PISA · TIMSS

Introduction International Large-Scale Assessments (ILSAs) in education provide valuable information for studying cross-country differences in educational outcomes and their potential causes, which lay a foundation for informing policy making in educational developments (Fischman et al., 2019; Husén, 1979; Lietz & Tobin, 2016). As a fundamental question, what constitutes educational outcomes has been widely discussed in both public and professional discourses (Johansson, 2016; Rutkowski & Rutkowski, 2018). Among the mainstream ILSA studies, conducted mainly by the International Association for the Evaluation of Educational Achievement (IEA) and the Organization for Economic Co-operation and Development (OECD), educational outcomes are usually framed as a nested structure with two disparate dimensions: content and cognitive dimensions (Mullis & Martin, 2013; OECD, 2017). As their names indicate, content dimension refers to which subject or knowledge area is being tested, for instance, language, mathematics, science, etc., while cognitive dimension refers to different psychological constructs that are being tested, for instance, comprehension, reasoning, synthesis, etc. These two dimensions are considered to be largely independent with each other. However, the conceptualization of cognitive dimension is much more complicated than content dimension, and there has been no standardization in the conceptualization across different ILSAs.

9

Assessing Cognitive Outcomes of Schooling

177

To shed light on the conceptualization of cognitive outcomes, this chapter provided an overview of theories underlying various conceptualizations of cognitive outcomes in different ILSAs. This is followed by a comparative analysis of similarities and differences between cognitive domain systems in the theoretical frameworks across different mainstream ILSA studies. In particular, the content dimension and cognitive dimension commonly used in various subject disciplines at different grade levels are compared across different ILSAs and also over time. Then implementations of assessment frameworks in some major ILSAs are illustrated in terms of item structures and item designs. Finally, implications of the analysis for future ILSAs as well as for school education are discussed.

Conceptualization of Cognitive Outcomes The most classic conceptualization of the assessment of cognitive outcomes goes back to Bloom’s taxonomy of learning objectives in the 1950s, where cognitive outcomes involve both knowledge and the development of intellectual skills and are categorized into six levels (from the simplest to the most complex): knowledge, comprehension, application, analysis, synthesis, and evaluation (Bloom, 1956). These six levels were later modified to “remembering, understanding, applying, analyzing, evaluating and creating” in Bloom’s revised taxonomy to reflect a more active and accurate form of thinking (Anderson et al., 2001). Different from Bloom’s taxonomy, Marzano’s new taxonomy categorizes the cognitive system into four components with different levels of complexity: knowledge retrieval, comprehension, analysis, and knowledge utilization. These four components are assumed to be the necessary mental processes of knowledge manipulation during problem-solving (Marzano & Kendall, 2007). Another widely used taxonomy of intellectual outcome is “depth of knowledge (DOK),” first proposed by Webb. DOK comprises four levels of cognitive behavior, including level 1, recall; level 2, using skills and/or conceptual knowledge; level 3, strategic thinking; and level 4, extended thinking (Webb, 1997). The variety of the aforementioned theories of educational outcomes in the cognitive dimension implies its complexity. Such complexity is not just reflected in the conceptualization but also in its context dependency. A relevant issue in the context of ILSAs is whether static conceptualization and operationalization should be applied across different settings (e.g., different disciplines, different age groups). This constitutes a widely discussed and debated topic in the field of international educational assessment, and the issue is addressed in the remaining sections of this chapter. The ILSAs included for study in this chapter are listed in Table 1. Among them, four studies are administered by the IEA, two by the OECD, one by the Conference of Ministers of Education of French-Speaking Countries (CONFEMEN), and one by the World Bank Group (WBG). The testing subjects of these studies are somewhat diverse with reading and mathematics standing out as the two most frequent ones, probably due to their close association with knowledge and skills needed in both school and daily life contexts. The diversity is also reflected in the target population, ranging from grade 2 to 65-year-olds including students and adult employees. Some ILSA studies that assess the same subject may have different foci. For example, the IEA studies

178

F. K. S. Leung and L. Pei

Table 1 Basic information of ILSAs included in this study Abbreviation ICCS ICILS

PIRLS TIMSS

PISA

PIAAC

PASEC STEP

Study IEA international civic and citizenship education study IEA international computer and information literacy Study Progress in international Reading literacy Study Trends in international mathematics and science study Programme for international student Assessment Programme for the international assessment of adult competencies

Organization IEA

Subject Citizenship

IEA

Computer and information

IEA

Reading

IEA

Mathematics Science

OECD

Programme for the analysis of education Systems Skills towards employability and Productivity

CONFEMEN

Mathematics Science Reading Literacy and reading Numeracy Problem-solving in technology-rich environments Reading Mathematics Reading

OECD

WBG

Age Grade 8 Grade 8 Grade 4 Grades 4 and 8 Age 15

Ages 16 to 65

Grades 2 and 6 Ages 15–64

focus on the curriculum and thus develop the assessment framework largely based on the elements of primary and secondary school curricula. In contrast, the tests organized by the OECD, such as PISA, are designed to measure literacy, which is more oriented to knowledge application and skills development in real-life situations (these differences are elaborated in the discussions below). Given the diversity of ILSAs in the aspects mentioned above, it is interesting and important to see how different dimensions of assessment (e.g., content and cognitive) are administered in different studies.

Similarities and Differences of Theoretical Frameworks Across Different ILSA Studies Organization of Theoretical Frameworks Each of the ILSA studies has its own specifically designed framework of assessment in terms of which aspects or domains of educational outcome to assess. These frameworks usually follow or are inspired by cognitive taxonomy theories, some of which are introduced above. The most common type of framework divides the educational outcomes to be assessed into two dimensions – usually termed as cognitive and content dimensions. Some ILSA studies regard these two dimensions

9

Assessing Cognitive Outcomes of Schooling

179

as being separate, whereas others regard them as interdependent. The majority of ILSA studies adopt the separate assumption and thus specify the requirements in content and cognitive dimensions separately, without being dependent on each other. Examples are ICCS, TIMSS, PISA, and PIAAC. On the contrary, the interdependent version of assessment framework adopted by other ILSA studies is usually categorized by a leading dimension – either content or cognitive dimension – followed by a specific description of the second dimension depending on the leading one. For example, PASEC organizes the assessment outcomes first by the content domain and then lists the specific cognitive requirements under each content domain; in contrast, ICILS, PIRLS, and STEP take the cognitive dimension as the leading one and elaborate each of the different cognitive domains with specific knowledge requirements. Regardless of the interdependency of the dimensions discussed above, the majority of the assessment frameworks adopt the content and cognitive dimensions with the aim of separating the constructs of knowledge and skills that are being tested. Such commonality provides a convenient basis for systematic and in-depth comparison across different studies. In addition to the content and cognitive dimensions, a few other studies introduce a third dimension for characterizing the test items: the context dimension. Note that “context” here does not refer to the survey background of schools, teachers, or students but to the situations in which question items are embedded. Some studies recognize the importance of this dimension in the sense that context should be relevant and diverse. For instance, STEP requires that the literacy assessment should use a variety of materials in question items to cover a broad range of relevant settings, such as home and family, work and training, leisure and recreation, etc. The context dimension is listed separately in PISA and PIAAC as well as in parallel with the content and cognitive dimensions, but the subcategories are different from the context domains in STEP. Different organizations of the framework may also be due to different orientations underlying the different aspects of education outcome. There are two typical examples of orientations: curriculum-oriented (as in TIMSS) or literacy-oriented (as in PISA). Orientation to curriculum mainly serves as an evaluation of how well students learn from the curriculum and how much learning outcomes align with both the intended and the implemented curriculum. In other words, such ILSA studies may be considered as a kind of alignment studies, the alignment between curriculum intention and implementation versus student achievement. In line with this orientation, curriculum-oriented studies design and develop the content of assessments in close accordance with the curriculum intentions and the curriculum as implemented in the schools. In contrast, assessments aiming for literacy place more emphasis on the fundamental competence in real-life contexts. As such, studies that are literacy-oriented develop the content of assessments to be more closely related to general knowledge and skills, not just those bound by the curriculum contents. Despite the different orientations to the aspect of education outcomes examined, the two types of assessments mentioned above do have large overlap in the content design. Many subject matters in various curricula are essential for developing

180

F. K. S. Leung and L. Pei

knowledge and skill for tackling real-life issues: reading and mathematics are the two most representative examples and are covered in most ILSAs. For this kind of subject matters, there is large overlap in the assessment contents across different orientations, as the array of competences needed for the learning performance in the school curriculum context will significantly contribute to achieving the requirements of literacy-oriented assessments as well and vice versa. Indeed, reading and numeracy skills are universally listed in the learning objectives in all national curriculum standards. Taken together, there exists a high degree of similarity in the dimensions being tested across the frameworks of dimension categorization in various ILSA studies. Among them, the categories of content dimension and cognitive dimension are the most widely accepted ones. These two thus can serve as a convenient basis for systematic and in-depth comparison of the similarities and differences in the dimension frameworks across different ILSA studies. This is elaborated in the following sections.

Comparison of Content Dimensions It is natural to expect great diversity in content dimensions across different ILSA studies in education as most of them were specifically designed and developed to assess different aspects of educational outcome and/or for assessing different populations. For example, in IEA studies, ICCS assesses educational outcomes on civic and citizenship education, TIMSS assesses the knowledge and skills involved in the mathematics and science curricula, and PIRLS assesses reading literacy. Therefore, it is not very meaningful to compare the difference in the contents of different ILSA studies that are meant to assess different subject matters. However, it is interesting to compare the content domains covered by different ILSA studies that focus on the same subject matter. For instance, TIMSS and PISA both cover mathematics and science achievement, and PIAAC and STEP both cover reading literacy. If there are important differences between them, there must be good reasons for the existence of the differences, and the differences may express themselves subtly in the categorization framework of the content domains. In TIMSS, for example, for the subject of mathematics, the content dimension is divided into four domains, numbers, algebra, geometry, and data and chance; whereas in PISA, mathematics contents are categorized into change and relationships, space and shape, quantity, and uncertainty and data. The difference between the two organizations of the content domain may be regarded as a reflection of the different intents of the two assessments alluded to above (curricular-based for TIMSS and overarching ideas with explicit emphasis on the context in which mathematics competencies are applied for PISA) (Neidorf et al., 2006). Even within the same ILSA study, the content domain may also differ across different target populations, e.g., different age groups. Taking TIMSS science as an example, the content domains for Grade 4 are life science, physical science, and earth science, whereas in Grade 8, they are physics, chemistry, biology, and earth

9

Assessing Cognitive Outcomes of Schooling

181

science. The content domain may also differ for the same subject discipline across different ILSA studies. Take mathematics or numeracy as an example. The content of the assessment in PIAAC, which is meant for adults, is obviously substantially different from that designed for the Grade 4 students in TIMSS, due to the consideration of different levels of knowledge and skill acquisition between different age groups. In sum, the content domains differ substantially across ILSA studies. A notable point worth remarking here is that the difference is not merely due to different subject matters that different studies cover but is also driven by many other factors such as target population, the construct of education outcome, other theoretical concerns of the assessment organization, etc.

Comparison of Cognitive Dimensions The cognitive dimension assesses processes of a very different nature from the content dimension. In most contexts, these two dimensions are considered to be independent to each other. That is probably one of the reasons why most ILSA studies include both of these two dimensions, as considering only one without the other would be incomplete for examining the full cognitive structure of the outcome. By definition, cognitive dimension refers to the different cognitive constructs that education programs can and aim to enhance, for example, reasoning skills, comprehension ability, memory capacity, etc. Various terms have been used to label this cognitive dimension. They include “literacy, accomplishment, attainment, ability, capability, competence, competency, proficiency, skill, knowledge, and standards” (Leung, 2020). These terms may refer to some similar or different constructs, and the same cognitive ability can be reflected in many different forms or contents of assessment. The cognitive dimension is usually considered to be a much more important dimension of relevance to education, as the ultimate goal of education is to shape the cognitive systems of individuals that prepare them to tackle different issues in different contexts (Krathwohl & Anderson, 2009). As previously mentioned, numerous educational theories have attempted to establish a universal hierarchical system of cognitive processing. However, most of the mainstream ILSAs do not explicitly follow any of these existent theoretical frameworks directly and exactly, except for PISA, which specifically adopts Webb’s depth of knowledge framework. In principle, at least three levels of demand of cognitive skills or processes may be assessed along with the content dimension: (1) the knowing and understanding of key concepts, which mainly depends on memory; (2) applying and interpreting, which emphasizes the ability of effectively applying the learned knowledge; and (3) analyzing and evaluating complicated and non-routine problems, which test the highest level of ability in complex problem-solving. For various reasons, practical or theoretical, there are a handful of other variations of categorization of cognitive domains – some simplified the three levels discussed above, some expanded them. For example, TIMSS 1995 categorized the cognitive dimension (named

182

F. K. S. Leung and L. Pei

performance expectations) as falling into five levels or categories: knowing; using routine procedures; investigating and problem-solving; mathematical reasoning; and communicating. ICCS merged the last two of the three levels above into one, reasoning and applying, while retaining the one for knowing. In PIRLS, a study for assessing reading literacy, the domain of application is further divided into two subdomains: (a) making straightforward inference and (b) interpreting and integrating ideas and information. These two examples are mild variations of the three-level model that is still under the same framework. There are other variations which introduced new branches of the cognitive component that are conceptually novel to the three-level model. For example, communication could be a new cognitive construct that is not covered in any domain of the three-level model. PIAAC initiates a fourth domain named communicate in its mathematics assessment to represent the ability to explain mathematical concepts to peers. Similar to the content dimension, the cognitive dimension also varies across studies depending on the specific aims or scopes of the ILSA studies. With respect to the two different orientations mentioned above, curriculum-oriented and literacy-oriented studies, the structures of cognitive domains are systematically different from each other. The literacy-oriented assessments tend to examine elementary and fundamental cognitive abilities as these abilities are supposed to be universally needed in various life contexts. For instance, PIAAC highlights the communication skills in its numeracy framework as communicating plays an important role in describing and representing mathematical information and interpretations to someone else which is very common in daily life (OECD, 2012). In contrast, curriculum-oriented assessments focus on abilities that are more advanced and may not necessarily be widely needed in daily life contexts. These abilities are usually tightly connected to the specificities of the curricula of interest. For example, in the assessment framework of ICILS, the cognitive dimension largely focuses on knowing and applying. These two domains are very strongly dependent on the education afforded by the curriculum concerned. The third domain, creating, is the one that tends to examine universal cognitive abilities (Fraillon et al., 2019).

Development of Cognitive Dimension Across ILSAs Over Time Even for the same content and cognitive domains, has the conceptualization of cognitive outcomes changed over time for different subject disciplines? It has been more than half a century since the implementation of the first international largescale assessment study by the IEA. Due to the change of theoretical basis from behaviorism to cognitivism in the late 1950s, the frameworks of assessment of ILSAs have undergone several major reforms with the goal to improve its effectiveness in measuring cognitive dimensions. In this section, using IEA and PISA studies as examples, we summarized the changes in conceptualization of cognitive outcomes over the years in three representative subject disciplines of Mathematics Reading and Science.

9

Assessing Cognitive Outcomes of Schooling

183

Mathematics Assessment Frameworks IEA Studies TIMSS formally started in 1991 and was administered in 1995. Its establishment was largely based on the early studies by the IEA: First International Mathematics Study (FIMS) in 1964 and Second International Mathematics Study (SIMS) in 1981–1982. The taxonomies for the cognitive dimension used in TIMSS and its predecessors have evolved over the years and changed several times (Fig. 1). Influenced by Bloom’s taxonomy of educational objectives (Bloom, 1956), FIMS categorized the

Fig. 1 The development of the cognitive dimension in the TIMSS mathematics framework from 1964 to 2019

184

F. K. S. Leung and L. Pei

cognitive dimension of the assessment into five “intellectual processes” (Fig. 1). Compared to FIMS, SIMS was more curriculum-oriented, and it brought the cognitive dimension to a more fundamental level and called it “cognitive behavior dimension,” which can still be partially mapped to the Bloom’s categories. Built upon the first two studies in mathematics achievement, the first cycle of TIMSS conducted in 1995 organized the cognitive dimension into five domains according to “performance expectations.” The new organization was driven by the argument that the hierarchical organization of cognitive behaviors used in SIMS cannot reflect the complexity of the internal connections among different cognitive processes and that there were large overlaps among different cognitive processes used in SIMS. To address this drawback, TIMSS in 1995 replaced “cognitive behavior dimension” with “performance expectations” which was deemed to better characterize the performances as the manifestation of differential cognitive processes when engaged in the assessment tasks. In 2003, TIMSS started to term the various aspects of assessment as “cognitive domains,” independent to content domains. This two-dimensional framework has been maintained to the latest cycle of assessment in 2019. Since 2003, TIMSS moved from the 1995 Third International Mathematics and Science Study and its follow-up or repeat study in 1999 (known as TIMSS-R) to becoming the Trends in International Mathematics and Science Study and established the practice of including both Grade 4 and Grade 8 students as target populations. TIMSS 2003 included four cognitive domains for both populations: “knowing facts and procedures,” “using concepts,” “solving routine problems,” and “reasoning.” Compared with the cognitive dimension in 1995, the version in 2003 retains most of the categories except for problem-solving, as the TIMSS committee pointed out that problem-solving is a general skill which intermingles with the content domains and thus should not be listed as a separate cognitive domain (Mullis et al., 2003). However, the categories used in the 2003 framework still possess certain conceptual vagueness. For example, the meanings of “knowing facts and procedures” and “using concepts” are not clearly differentiable, which hampered the process of designing the cognitive domain-based items. To make cognitive dimension as conceptually simple as possible for easy implementation (e.g., in terms of item design), TIMSS 2007 further consolidated the previous cognitive framework and developed a much more simplified and distinguishable one: knowing, applying, and reasoning (Mullis et al., 2007). This framework covers most of the universal cognitive processes that are measured in various assessment tasks. So after more than 50 years of development by the IEA, the cognitive dimension of the TIMSS framework has undergone several reforms and has become a system that is substantially different from Bloom’s taxonomy of learning outcomes that has the drawback of conceptual overlapping due to its hierarchical organization but yet has dominated the educational assessment field for a long time.

PISA PISA has undergone three major stages of restructuring with regard to the cognitive dimension in its mathematics assessment framework (Fig. 2). Highlighting mathematical literacy applied in daily life, PISA is designed to encompass a set of general

9

Assessing Cognitive Outcomes of Schooling

185

Fig. 2 The development of the cognitive dimension in the PISA mathematics framework from 2000 to 2018

mathematical processes organized into three categories of competency classes in the first cycle of PISA in 2000. Similar to the hierarchical Bloom’s taxonomy, the processes defined in PISA are in ascending order of difficulty but do not necessarily need to be acquired in the same order for all students (Schleicher & Tamassia, 2000, p. 52), which allows for certain flexibility in item design and implementation of assessment. Extending the categories of competency used in 2000, PISA 2003 further consolidated and simplified the cognitive dimension into three categories of competency clusters based on the cognitive abilities needed to solve different mathematical problems. However, the cognitive dimension framework employed in 2003 was abandoned in 2012 and replaced with a new categorization based on the basic processes in mathematical problem-solving. The new categorization contains three sequential steps: (1) formulating situations mathematically; (2) employing mathematical concepts, facts, procedures, and reasoning; and (3) interpreting, applying and evaluating mathematical outcomes. The most significant difference between this new version of cognitive framework and the previous one lies in the emphasis on depicting the natural process of mathematical problem-solving rather than the basic cognitive abilities. The new framework has the advantage that the required mathematical competencies are revealed during each step of the whole process to different degrees, thus truly integrating “mathematization” with the assessment of corresponding mathematical competencies.

Science Assessment Frameworks IEA Studies The development of the TIMSS science assessment framework dates back to the First International Science Study (FISS) administered by the IEA in 1970–1971. The framework for science used in TIMSS has been revised and rewritten many times due to the change of views on the conceptualization for science learning: from a logical-empiricist view of science to a “Kuhnian” view of science and then to a sociocultural view of science (Kind, 2013). Under the influence of these key

186

F. K. S. Leung and L. Pei

theoretical developments in science education, the science assessment framework underwent a major shift in expectations and foci from mental concepts and processes to scientific argumentation practices, i.e., from science processes to science practices (Lehrer & Schauble, 2007). The science assessment framework of FISS introduced a two-dimensional matrix based on Bloom’s taxonomy of learning outcomes, which organized learning outcomes as a combination of content and behavior (what students should do in the tasks). At this stage, the content and behavior dimensions were inseparable as it was believed that students cannot understand content knowledge without entailing cognitive behaviors and vice versa. The “behavioral dimension” used in the framework of FISS was relabeled as “objectives dimension” in the Second International Science Study (SISS) in the 1980s in order to tackle the inseparability between content and cognitive demands. In 1995, the first cycle of TIMSS dealt with the entanglement of different dimensions by splitting the original behavior dimension into performance expectations which contained Bloom’s cognitive domain and scientific inquiry processes and perspectives which included attitude and orientations. The TIMSS committee provided a solution to this problem in the framework of TIMSS 2003 but at a certain cost (Mullis et al., 2003). Two domains in the earlier version, attitudes and orientations, were removed entirely, and the process of scientific inquiries was moved out of this dimension as a separate one. These two moves reestablished a two-dimensional matrix, making the two dimensions more dissociable but aligned with and simplifying the three categories in Bloom’s taxonomy. As shown in Fig. 3, the categories of cognitive dimension were relabeled to match the revised version of Bloom’s taxonomy since 2007 (Krathwohl & Anderson, 2009).

PISA The first PISA adopted a three-dimensional framework, including “scientific processes,” “scientific concepts,” and “situations and areas of application,” where scientific processes were the main focus in the assessment. Due to the difficulty in dissociating scientific processes from knowledge, PISA adopted a scientific literacyfocused approach, stressing less on the knowledge of traditional science experiments in a laboratory context and more on the processes of evaluating scientific evidence and claims in socioscientific contexts. Following this blueprint, the first PISA science framework developed five categories of scientific process. The process-oriented focus has been retained in the subsequent cycles of PISA science framework but reorganized into three categories in 2003, similar to the three main “phases of the scientific discovery process” (Klahr & Li, 2005). A major change in the development of the PISA science framework took place in 2006 when science became the main focus of the study. The “scientific processes” used in the first two cycles of PISA were substituted by “scientific competencies” as shown in Fig. 4. By modeling the competencies needed in the tasks, the new framework of PISA moved away from explaining scientific principles to performing basic tasks that students should be able to deal with in everyday life context. This restructuring better matches PISA’s literacy orientation, and this new framework has remained unchanged since 2006 and been well-accepted by an increasing number of science educators ever since.

9

Assessing Cognitive Outcomes of Schooling

187

Fig. 3 The development of the cognitive dimension in the TIMSS science framework from 1970 to 2019

Reading Assessment Frameworks IEA PIRLS Ten years before PIRLS initiated its first assessment in 2001, the IEA administered the International Reading Literacy Study (IRLS) to assess reading literacy of fourth graders across more than 30 countries. However, the IEA discontinued this international reading literacy study and developed a new assessment of reading literacy, the Progress in International Reading Literacy Study (PIRLS), based on the latest

188

F. K. S. Leung and L. Pei

Fig. 4 The development of the cognitive dimension in the PISA science framework from 2000 to 2018

measurement approaches. In IRLS in 1991, the cognitive dimension of the framework was named “reading processes,” which included six categories: verbatim, paraphrase, main themes, inference, locate information, and following directions. In contrast, PIRLS in 2001 reorganized the processes of comprehension required for developing reading literacy into four categories: focus on and retrieve explicitly stated information, make straightforward inferences, interpret and integrate ideas and information, and evaluate and critique content and textual elements. A comparison on the items between these two assessments shows significant differences in cognitive demands: PIRLS requires “deeper thinking” in assessment tasks than IRLS (Kapinus, 2003). For example, most of the items in IRLS categorized as “verbatim” process cannot find an appropriate position in the framework of PIRLS. In addition, although most of the PIRLS items written for the “examine and evaluate content, language, and textual elements” domain can be fitted into IRLS’s “inference” category, these two domains are not perfectly interchangeable (as shown in Fig. 5).

PISA PISA defines reading literacy in a way similar to PIRLS which both delineate reading as “an active process involving understanding and using written texts” (Shiel & Eivers, 2009). As illustrated in Fig. 6, the principles of organizing the cognitive aspects in PISA’s reading assessment framework before 2018 were mainly on the basis of multiple linguistic-cognitive processes involved in reading, models of discourse comprehension, and theories of performance in solving information problems (OECD, 2019). The notable change in the PISA 2018 reading assessment framework was to adapt to changes in the nature of reading literacy in a technology-enhanced environment. The PISA 2018 reading framework replaced “cognitive aspects” with “cognitive process” to align with the recent development and terminologies used in reading psychology research. Aside from a set of specific

9

Assessing Cognitive Outcomes of Schooling

189

Fig. 5 The development of the cognitive dimension in the IRLS and PIRLS framework from 1991 to 2021

Fig. 6 The development of the cognitive dimension in the PISA reading framework from 2000 to 2018

cognitive processes required in reading, readers also need to use some other competencies such as goal setting and goal achievement, to engage themselves in or disengage themselves from a particular text or re-engage and integrate information across multiple sources of text. To acknowledge this kind of goal-driven skills entailed in reading, PISA 2018 defined two broad categories of reading process:

190

F. K. S. Leung and L. Pei

one is text processing, which includes traditional cognitive skills required in text comprehension, and the other is task management, which monitors the process toward the goals in reading tasks.

Implementation of the Frameworks in Item Design In the previous sections, we compared the assessment frameworks of the abovelisted ILSAs from the perspective of assessment structure and theoretical basis. How do the conceptual structures and frameworks in different studies manifest themselves in the implementation of the assessment? For example, how are assessment items designed to accurately assess the specific domains to be tested? This is an important and practical angle to look at the assessment frameworks that is very different from the discussions of the theoretical or conceptual frameworks above. To examine the frameworks from this angle, we briefly elaborated on some similarities and difference across the ILSAs in item design.

Format of Assessment In ILSA studies, multiple-choice and constructed-response questions are the two most common types of question formats. These two types of assessment have very different properties in terms of the ability to reflect the cognitive component of interest. It is widely believed that for testing high-level cognitive abilities, constructed-response questions may be more suitable as they allow a higher degree of freedom in assessing the construction of thoughts (Mullis & Martin, 2013, p. 93). In contrast, multiple-choice is more suitable for testing low-level skills or knowledge-based abilities. Different ILSA studies have different compositions of items and question types depending on their different aims. Due to the relative objectivity and ease of administration, multiple-choice questions are always the majority of the assessment items in almost all ILSA studies, since they allow for greater coverage of the domains of testing. In order to assess student’s academic performance more comprehensively, it is in general desirable to have more diversity in the question types. For example, PISA further divides constructed-response questions into open constructed-response and closed constructed-response ones. Wu (2010) compared the item features between TIMSS and PISA and found that both TIMSS and PISA use multiple-choice and constructed-response item formats, but PISA has far more items in constructed-response format than TIMSS. Approximately, two-thirds of items in PISA are in constructed-response format, while two-thirds of items in TIMSS are in multiple-choice format (Wu, 2010). These findings are consistent with the analysis in the previous sections of this chapter. Another format of assessment implemented in TIMSS 1995 is known as performance assessment. It mainly addresses the last three levels of the cognitive dimension or performance expectations (see discussions in section “Comparison of Cognitive Dimensions” above) in TIMSS 1995: investigating and problem-

9

Assessing Cognitive Outcomes of Schooling

191

solving; mathematical reasoning; and communicating. “Performance assessment refers to the use of integrated, practical tasks, involving instruments and equipment, as a means of assessing students’ content and procedural knowledge, as well as their ability to use that knowledge in reasoning and problem solving” (Harmon et al., 1997, p. 5). The use of such a test format is to assess competencies that cannot be easily demonstrated through a paper-and-pencil test. It is argued that such a format of assessment “permits a richer and deeper understanding of some aspects of student knowledge and understanding than is possible with written tests alone” (Harmon et al., 1997, p. 5). Performance assessment has its own limitation. Standardized test conditions in different countries are difficult to achieve, posing threats to the comparability of the results. Also, using instruments and equipment to perform practical tasks, if conducted rigorously, is labor intensive and hence very expensive. As a result, performance assessment in TIMSS 1995 only achieved limited success, and this format of assessment was stopped after the 1995 round of TIMSS. This shows that truly genuine testing of non-paper-andpencil competencies is difficult to achieve. An example of a performance assessment task is shown in Fig. 7.

Comparison of TIMSS and PISA Items’ Features Although both TIMSS and PISA label the content and cognitive domains each item belongs to, the difference between the goals of TIMSS and PISA leads to much difference in the design of assessment items. Since the assessment in TIMSS is based on a comprehensive analysis of the mathematics and science curricula, TIMSS items tend to be shorter and more focused on facts and processes and can be characterized as closer to “pure” mathematics and scientific knowledge within the formal school learning context. In contrast, PISA is designed to be literacy-oriented targeting at the functional competencies in real-world contexts (Gronmo & Olsen, 2006). As a result, items in PISA are by and large written with relatively longer text and instructions. This difference in the amount of reading required in mathematics tests between TIMSS and PISA is also noted by Wu (2010), who found that items in PISA require much more amount of reading than TIMSS items in describing the real-world problem context and in linking the context to the pure mathematics problem. The average number of words in a PISA item stem is almost twice as many as the average number of words in a TIMSS item stem. Related to the amount of reading required is the structures of the TIMSS and PISA items, which were also found to be dissimilar. For unit structure, nearly 85% of the items in TIMSS are stand-alone items; in contrast, only 41% of the PISA items are designed as stand-alone ones considering the trade-off between heavy amount of texts in problem setting and the number of items that can be included in the test. Wu (2010) also found that, from the perspective of the cognitive dimension, most TIMSS items can be classified as falling into the domain of solving routine problems; while most items in PISA belong to the reasoning domain (Wu, 2010).

Fig. 7 “Shadow” – an example of a performance assessment task from TIMSS 1995 (Harmon et al., 1997)

192 F. K. S. Leung and L. Pei

9

Assessing Cognitive Outcomes of Schooling

193

Figure 8 shows a typical item from TIMSS with very concise instruction and simplified context. With regard to items in PISA (an example is shown in Fig. 9), they are usually organized in units with items within the same unit sharing the same real-world context. Fig. 8 A sample TIMSS item (content domain, geometry; cognitive domain, reasoning) for Grade 8 (IEA, 2009)

Fig. 9 A sample PISA item unit of Farm (OECD, 2006)

194

F. K. S. Leung and L. Pei

New Forms of Assessment To enable the assessment of some special or newly emerged subject matter, new forms of assessment may need to be developed, as traditional forms (multi-choice, constructedresponse) may not be sufficient to test the construct of interest. For example, the aim of ICILS is to evaluate computer and information literacy as well as computational thinking of students in Grade 8. Given the specificity of the knowledge and skills concerned, the assessment is highly computerized with very high degree of flexibility supported by information technology. These new forms of assessment may have the potentials of assessing students in a more realistic situation that is closer to real life since ICILS assesses knowledge and skills such as Internet searching and designing multimedia documents. Besides, as many of the tasks to be tested in ICILS are highly complex, such as developing multimedia products, computer programming, advanced software operations, etc., new forms of assessments need to be developed and applied. More generally, the development of information technology has overwhelmingly changed the way people work and learn. For example, the nature of literacy is undergoing a deep change as new sets of skills based on information technology become the necessary competence. In order to meet this new challenge in future learning, more and more ILSAs begin the transition to conducting their assessment in a digital manner. For instance, TIMSS initiated its first digital attempt, e-TIMSS, from 2019 which included additional innovative problem-solving and inquiry tasks (known as PSIs, a sample item is shown in Fig. 10) based on the advantages of digital assessment platform (Fishbein et al., 2018; Mullis & Martin, 2017). This new platform allows for the simulation of real-world or laboratory situations in an attractive, interactive, and responsive way. Owing to the digitalization of operations, the platform also provides digital tracking of students’ problem-solving or inquiry paths, which allow for rich data in performance evaluation to be collected. However, the design of the innovative tasks and the data analysis of learning paths are very demanding due to the immaturity of learning design and analytics under technology-enhanced environment. Another representative example of new forms of assessment is in the area of reading literacy. As aforementioned, PISA and PIRLS have adapted their framework because the nature of reading literacy has changed with the increasing involvement of reading and writing on digital devices. For these two assessments, they not only launch their digital assessment platform but also keep updating the assessment focus of the framework to meet the up-to-date demands in keeping pace with the rapid development of technology (Mullis & Martin, 2015; OECD, 2019).

Discussions and Conclusion Summary of Findings In this chapter, the conceptualization and theoretical frameworks of cognitive outcomes for various subject disciplines in different major ILSAs have been analyzed and compared. Different categorizations of the outcomes, especially in terms of content

9

Assessing Cognitive Outcomes of Schooling

195

Fig. 10 “Lily’s Garden” – An example mathematics PSI task for fourth graders in the eTIMSS Player (Cotter, 2019)

and cognitive dimensions, and their development over time, have been presented. How the different assessment structures and theoretical bases impacted the implementation of the assessments in different ILSAs are illustrated in terms of the structure of the assessment as well as the formats and features of the test items. New forms of assessment brought about by the shifting expectations on students because of the transformation of the ways people work and learn due to the changing time and new developments of information technology are also touched upon. Clearly these have implications for both future ILSAs as well as education policies in individual countries.

Implications The Design of Future ILSAs In the past decades, ILSAs such as TIMSS and PISA have attracted much attention in the education community and beyond. While results of ILSAs have often been

196

F. K. S. Leung and L. Pei

abused (e.g., viewing ILSAs as competitions and focusing unduly on the ranking of countries by students’ achievement scores), proper use of ILSA results will provide benchmarks for participating countries against which they may measure the achievement of their students and the effectiveness of their education system (Leung, 2014). This is possible because ILSAs are studies “with endorsement from a large number of countries” (Leung, 2011, p. 391) and because of the rigorous methodologies adopted in ILSAs. In addition, results of ILSAs may be used to study the impact of different background variables on educational achievement, since often it may not be practicable or ethical to manipulate some background variables within a country (e.g., resource allocation to a school) in order to find out their impact on student achievement. Also, many variables within a country are uniform and cannot be manipulated, and to study the impact of those variables on student achievement, we have to collect data in different countries, where the variables differ. In the words of Drent et al., in ILSAs, we are using the world as “a natural educational research laboratory” (Drent et al., 2013). In such an educational “experiment” or research, cognitive outcomes are an apposite choice to serve as the criterion variables (although one should not disregard noncognitive outcomes such as a positive attitude toward learning). A clarification of the conceptions of cognitive outcomes and a systematic understanding of the different frameworks, as deliberated in this chapter, are thus fundamental in collecting research evidence for studying the determinants of educational achievement. Based on results of ILSAs on students’ achievements in cognitive outcomes, policy makers may then be able to make evidence-based decisions on curriculum design and devise measures to improve the quality of education in their countries. As claimed by the OECD, “the findings [of ILSAs] allow policy makers around the world to gauge the knowledge and skills of students in their own countries in comparison with those in other countries, set policy targets against measurable goals achieved by other education systems, and learn from policies and practices applied elsewhere” (OECD, 2019). As can be seen in the discussion in this chapter, cognitive outcomes are one of the most common and important content for comparison in ILSAs. Given its importance, a clarification of the conceptions of cognitive outcomes and their frameworks, as presented in this chapter, will surely contribute to informing the design of future ILSAs.

What Should a Country Be Achieving in Education? Clear conceptualization of cognitive outcomes and their frameworks do not only inform how future ILSAs should be conducted; they also help education authorities in participating countries to clarify and reflect on what they should be achieving in education in terms of the cognitive outcomes of students. What cognitive outcomes to inculcate in our students is perhaps one of the most important questions that the governments all around the world need to grapple with. The education authority of a country will no doubt frame the cognitive outcomes they want their students to acquire according to the specific cultural traditions and the stage of economic development of the country. But as countries are invariably engaging in economic competitions with their regional and global counterparts, knowledge of and

9

Assessing Cognitive Outcomes of Schooling

197

benchmarking against the educational objectives in other countries are extremely important in formulating such education policies in the home country. Participation in ILSAs does not only help a country to understand the policies and achievements of students in other participating countries; it will also help the country to reflect upon its own educational objectives and in particular, to reflect on what educational outcomes they want to see in students coming out of their system. In participating in ILSAs, countries are confronted with the need to arrive at a commonly agreed understanding on what they are comparing and what cognitive outcomes are in particular. This will force them to articulate and justify the cognitive outcomes they want to inculcate in the students of their own countries. As alluded to earlier in this chapter, the conceptualizations and frameworks of cognitive outcomes are usually based on some learning theories and/or instructional theories. Within a certain country, such learning and/or instructional theories may be implicit or taken for granted. The need to reach a common understanding with other participating countries in ILSAs on what cognitive outcomes are provides the opportunity for participating countries to reflect upon their countries’ educational objectives and the underlying assumptions and to reaffirm or modify those objectives in light of the knowledge of the status in other countries. It is the hope of the authors that the comparative analysis of the concepts of cognitive outcomes conducted in this chapter will help countries in that process.

Concluding Remarks Cognitive outcomes may not be as simple as the term sounds. Since we want to achieve fair comparison in ILSAs, there is a need for formulation of a lucid conceptual understanding and a commonly agreed framework on cognitive outcomes, forcing us to clarify the various conceptual meanings and frameworks of cognitive outcomes. This exercise clearly helps contribute to the design of future ILSAs. Also, the cognitive outcomes of students aspired in different countries as expressed in their curricula are sometimes taken for granted in individual countries. Participation in ILSAs thus provides a good opportunity for educators around the world to reflect on what we want to achieve in education in terms of students’ cognitive outcomes in different subject disciplines and how best should such outcomes be assessed in their own countries and in an international context. The clarification of the concept of cognitive outcomes in this chapter may provide the answer to an important component of the question of “What is schooling for?”

References Anderson, L. W., Krathwohl, D. R., & Bloom, B. S. (2001). A taxonomy for learning, teaching, and assessing : A revision of Bloom’s taxonomy of educational objectives. Longman. Bloom, B. S. (Ed.). (1956). Taxonomy of educational objectives : The classification of educational goals: Handbook 1: Cognitive domain. David McKay.

198

F. K. S. Leung and L. Pei

Cotter, K. (2019). Evaluating the validity of the eTIMSS 2019 mathematics problem solving and inquiry tasks. Doctor of Philosophy, Boston College. Drent, M., Meelissen, M. R. M., & van Der Kleij, F. M. (2013). The contribution of TIMSS to the link between school and classroom factors and student achievement. Journal of Curriculum Studies, 45(2), 198–224. https://doi.org/10.1080/00220272.2012.727872 Fischman, G. E., Topper, A. M., Silova, I., Goebel, J., & Holloway, J. L. (2019). Examining the influence of international large-scale assessments on national education policies. Journal of Education Policy, 34(4), 470–499. https://doi.org/10.1080/02680939.2018.1460493 Fishbein, B., Martin, M. O., Mullis, I. V. S., & Foy, P. (2018). The TIMSS 2019 item equivalence study: Examining mode effects for computer-based assessment and implications for measuring trends. Large-Scale Assessments in Education, 6(1), 11. https://doi.org/10.1186/s40536-0180064-z Fraillon, J., Ainley, J., Schulz, W., Duckworth, D., & Friedman, T. (2019). IEA international computer and information literacy study 2018 assessment framework. Springer. Gronmo, L. S., & Olsen, R. V. (2006). TIMSS versus PISA: The case of pure and applied mathematics. Paper presented at the 2nd IEA International Research Conference, Washington, DC, November 8–11. Harmon, M., Smith, T. A., Martin, M. O., Kelly, D. L., Beaton, A. E., Mullis, I. V., . . . Orpwood, G. (1997). Performance assessment: IEA’s third international mathematics and science study (TIMSS). International Association for the Evaluation of Educational Achievement . . .. Husén, T. (1979). An international research venture in retrospect: The IEA surveys. Comparative Education Review, 23(3), 371–385. IEA. (2009). TIMSS 2007 Assessment. Copyright © 2009 International Association for the Evaluation of Educational Achievement (IEA). from TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. Johansson, S. (2016). International large-scale assessments: What uses, what consequences? Educational Research: International policy borrowing and evidence-based educational policy making: relationships and tensions, 58(2), 139–148. https://doi.org/10.1080/00131881.2016. 1165559 Kapinus, B. (2003). PIRLS-IEA reading literacy framework: Comparative analysis of the 1991 IEA reading study and the progress in international reading literacy study. Working paper series. Kind, P. M. (2013). Conceptualizing the science curriculum: 40 years of developing assessment frameworks in three large-scale assessments. Science Education, 97(5), 671–694. Klahr, D., & Li, J. (2005). Cognitive research and elementary science instruction: From the laboratory, to the classroom, and back. Journal of Science Education and Technology, 14(2), 217–238. Krathwohl, D. R., & Anderson, L. W. (2009). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives. Longman. Lehrer, R., & Schauble, L. (2007). Scientific thinking and science literacy. In W. Damon, R. M. Lerner, K. A. Renninger, & I. E. Siegel (Eds.), Handbook of child psychology. Leung, F. K. S. (2011). The significance of IEA studies for education in East Asia. In C. Papanastasiou, T. Plomp, & E. C. Papanastasiou (Eds.), IEA 1958–2008-50 years of experiences and memories. Research Center of the Kykkos Monastery. Leung, F. K. S. (2014). What can and should we learn from international studies of mathematics achievement? Mathematics Education Research Journal, 26(3), 579–605. https://doi.org/10. 1007/s13394-013-0109-0 Leung, F. K. S. (2020). Core competencies of Chinese students in mathematics – what are they? In Xu, B., Zhu, Y. and Lu, X. (eds), Beyond Shanghai and PISA: Cognitive and non-cognitive competencies of Chinese students in mathematics. Springer. Lietz, P., & Tobin, M. (2016). The impact of large-scale assessments in education on education policy: Evidence from around the world. Research Papers in Education, 31(5), 499–501. https:// doi.org/10.1080/02671522.2016.1225918 Marzano, R. J., & Kendall, J. S. (2007). The new taxonomy of educational objectives. Corwin Press.

9

Assessing Cognitive Outcomes of Schooling

199

Mullis, I. V. S., & Martin, M. O. (2013). TIMSS 2015 assessment frameworks. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timssandpirls.bc. edu/timss2015/frameworks.html Mullis, I. V. S., & Martin, M. O. (2015). PIRLS 2016 assessment framework (2nd ed.). Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timssandpirls. bc.edu/pirls2016/framework.html Mullis, I. V. S., & Martin, M. O. (2017). TIMSS 2019 assessment frameworks. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timssandpirls.bc. edu/timss2019/frameworks/ Mullis, I. V. S., Martin, M. O., Smith, T. A., Garden, R. A., Gregory, K. D., & Gonzales, E. J. (2003). TIMSS assessment frameworks and specifications 2003 (2nd ed.). International Study Center, Boston College. Mullis, I. V. S., Martin, M. O., Ruddock, G. J., O'Sullivan, C. Y., Arora, A., & Erberber, E. (2007). TIMSS 2007 assessment frameworks. International Study Center, Boston College. Neidorf, T. S., Binkley, M., Gattis, K., & Nohara, D. (2006). Comparing mathematics content in the National Assessment of Educational Progress (NAEP), Trends in International Mathematics and Science Study (TIMSS), and Program for International Student Assessment (PISA) 2003 assessments. Technical report. NCES 2006-029. National Center for Education Statistics. OECD. (2006). PISA released items – Mathematics. Retrieved from https://www.oecd.org/pisa/ 38709418.pdf OECD. (2012). Literacy, numeracy and problem solving in technology-rich environments: Framework for the OECD survey of adult skills. Retrieved from https://doi.org/10.1787/ 9789264128859-en OECD. (2017). PISA 2015 assessment and analytical framework: Science, reading, mathematic, financial literacy and collaborative problem solving (Revised edition). Retrieved from https:// doi.org/10.1787/9789264281820-en OECD. (2019). PISA 2018 assessment and analytical framework: PISA. OECD Publishing. Rutkowski, L., & Rutkowski, D. (2018). Improving the comparability and local usefulness of international assessments: A look back and a way forward. Scandinavian Journal of Educational Research, 62(3), 354–367. https://doi.org/10.1080/00313831.2016.1261044 Schleicher, A., & Tamassia, C. (2000). Measuring student knowledge and skills: The PISA 2000 assessment of reading, mathematical and scientific literacy. Education and skills. OECD. Shiel, G., & Eivers, E. (2009). International comparisons of reading literacy: What can they tell us? Cambridge Journal of Education: Teaching English, Language and Literacy, 39(3), 345–360. https://doi.org/10.1080/03057640903103736 Webb, N. L. (1997). Criteria for alignment of expectations and assessments in mathematics and science education. Research monograph no. 6. Wu, M. (2010). Comparing the similarities and differences of PISA 2003 and TIMSS. OECD education working papers, no. 32, OECD Publishing. https://doi.org/10.1787/ 5km4psnm13nx-en.

Socioeconomic Inequality in Achievement Conceptual Foundations and Empirical Measurement

10

Rolf Strietholt and Andre´s Strello

Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distributive Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Different Goods, Different Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Socioeconomic Inequality: Implicit Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Measurement Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indicators of Socioeconomic Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification of Continuous and Categorical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Standardization and Threshold Setting: International Comparability and National Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Empirical Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data and Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Correlations Between the Different Measures of SES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Correlations Between the Different Measures of SES Inequality in Achievement . . . . . . . . Standardization of Inequality Measures: Relative and Absolute Measures . . . . . . . . . . . . . . . . Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

202 202 204 205 206 206 207 208 209 209 210 212 214 217 218 218

Abstract

The study of social inequality in student achievement is based on ideas of justice which are often not sufficiently explicated. Furthermore, there is a large set of measures used to quantify socioeconomic inequality in achievement. The first part of this chapter explains conceptual principles underlying measures of social inequality in achievement. For this purpose, we first introduce the concepts of adequacy and equality and discuss how social inequality extends them. In this respect, we emphasize the nature of education and its intrinsic, instrumental, R. Strietholt (*) · A. Strello IEA, Hamburg, Germany e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_11

201

202

R. Strietholt and A. Strello

individual, and societal value. The second part of the chapter discusses key measurement issues researchers deal with when studying achievement gaps between students of different socioeconomic status. We summarize research on commonly used indicators of socioeconomic background and compare children and parent reports. Different sets of statistical measures for continuous, categorical, single, and multiple background variables are reviewed, and the distinction between relative and absolute inequality measures is discussed with a focus on the implications for cross-national comparisons, and trend studies within countries over time. Keywords

Socioeconomic inequality · Equity · Social inequality · Achievement inequality

Introduction To understand if and why we should be concerned about socioeconomic inequality in student achievement, it is important to briefly review some conceptual philosophical foundations relevant to studying inequality. To understand implicit assumptions in the study of social inequality in student achievement, we first discuss theoretical foundations of political philosophy. The analysis of inequality is based on normative assumptions about justice. In the following, we will try to make these assumptions explicit. We consider it important to develop an understanding that socioeconomic inequality is only one form of inequality. The provision and distribution of goods such as education, health care, or wealth are public concerns. The notion of excellence and equity introduce a normative dimension to the discussion around the provision and distribution of public goods. On the one hand, the discussions are about the achieved levels of literacy, health, or income as measured by, for example, the mean achievement scores in international student assessments, the average life expectancy, or the gross domestic product (GDP). On the other hand, the discourse is based on inequality as a measure by, for example, the achievement gap between privileged and disadvantaged children, unequal access to medical care, or the Gini coefficient for income inequality. While the term excellence implies that it is desirable to achieve high achievement levels of a public good (e.g., education, wealth, health, etc.), the term equity suggests it is also desirable to minimize inequalities in the distribution of goods (e.g., education, wealth, health, etc.). Next, we will discuss how “minimizing inequality” can have very different meanings.

Distributive Rules The philosophical literature on inequality typically distinguishes between a) an object or public good that is distributed, and b) a distributive rule that is used to

10

Socioeconomic Inequality in Achievement

203

assess inequality (see Brighouse et al., 2018). In this chapter we focus on the object “education” (or more specifically on “student achievement”), but to illustrate that different distributive rules may be justified in different contexts it is also useful to think about other public goods such as health care or wealth (Atkinson & Bourguignon, 2000, 2015; Van Doorslaer & Van Ourti, 2011). Distributive principles are central for the study of inequality because they define how to fairly distribute a public good among the members of a group, such as the students in an educational system or the citizens of a state. Different distributive rules are best illustrated with concrete examples. Without much argument, we will focus on three distributive rules: equality, adequacy, and social inequality. Three related popular measures of income inequality are the Gini index, the proportion of the population living in poverty, and the gender payment gap. Equality in wealth is frequently measured by the Gini index; a coefficient of 0 indicates perfect equality, where everyone has the same income, and a value of 1 indicates maximal inequality where a single person has all possible income and everyone else has none. Another distributive rule is adequacy, which is closely related to poverty. A simple measure to quantify poverty is the proportion of people who do not reach a minimum income, however defined. Studying poverty implies that it is considered unfair to distribute income in such a way that some people do not receive a minimum income, but at the same time inequality above a certain threshold is not problematized. In fact, there are good reasons to justify that the variation in income above the poverty threshold is not considered problematic or even wanted. For example, the high incomes of surgeons may be a reward for prior investments into education or for taking responsibility for the lives of other people. Following liberal ideas of individual freedom and choice, it should be left to the individual to decide whether he or she wishes to study for a long period or take on a high level of responsibility. The key differences between equality and adequacy is that the latter concept introduces the distinction between unjust and just inequalities in some public goods. Gender inequalities in income is yet another form of inequality; the decisive factor here being whether the income of men and women is different. In this example, we look at gender differences, but the same reasoning can also be applied to differences between race or socioeconomic status (SES). The idea behind measuring gender differences is that an unequal distribution of income is not problematic per se, but only if there are systematic differences between men and women. There are two measures of gender inequality: the gap in the mean income of men and women, and the comparison of the share of men and women in poverty. These measures are somehow linked to the two distributive rules: equality and adequacy. It should be noted, however, that both measures of gender inequalities do not problematize variation in income, nor poverty itself. If half the men and half the women live in poverty, there is no gender inequality, yet is this fair? If we do not wish for people to live in poverty, we should know how many people are living in poverty, regardless of whether they are men or women. The distribution of public goods is a constant and contentious topic of the political debate. There are no universally valid principles which can be justified in a similar way everywhere, and at any time. Rather, different distribution rules can

204

R. Strietholt and A. Strello

also be applied to other goods. In the USA, there is a controversial discussion as to whether all Americans should have health insurance. Other countries may have universal health insurance for all of their citizens, but there are discussions on which services are standard for all the insured citizens, and which services are offered as individual, additional benefits. Progressive taxation is a means to increase the tax burden of higher incomes, while a universal basic income is a measure to reduce poverty. In some cases, it may be justified to eliminate any differences, but at the same time, it is important to stress that such equalization largely ignores individual freedom (Nozick, 1974).

Different Goods, Different Rules Hardly anyone would want to tell someone else to go to football at the weekend, but not to the opera. Some drink beer, others prefer wine. Inequalities are not necessarily an indicator of injustice but of a liberal society. At the same time, however, it would not be argued that in liberal societies all inequalities are legitimate. Poverty and access to (at least basic) health care are public concerns, whereas recreational activities are not. What about education? Education plays different roles for individual and social development (Robeyns, 2006; Strietholt, 2014). One can play the flute for pleasure, or to earn money in the orchestra. One can learn a foreign language for fun, or to earn money abroad. In other words, education can have an intrinsic or instrumental value. This applies both on an individual and collective level. If several languages are taught and spoken in a country, the instrumental value is the strengthening of the national economy in a globalized world. The ability to communicate with people from other countries is also an intrinsic value on a collective level, because cultural exchange is related to mutual understanding. Based on the different roles and functions of education, different distributive rules provide a suitable conceptual framework to study educational inequality. For example, some people take great pleasure in studying classical literature. However, it hardly seems appropriate to demand that every student must read all works of Lessing, Schiller, and Goethe in schools. It could be argued that liberalism provides a useful approach to justice in this context, in which young people are free to engage in literature, sports, or technology. But what about basic reading literacy? Probably, egalitarianism would find more support here, since reading is a basic prerequisite for cultural, economic, or political participation in our society. It seems hardly justifiable to require children (or their parents) to decide for themselves whether they want to learn to read or not. International large-scale assessments test students at different levels, for example, the Trends in International Mathematics and Science Study (TIMSS) assesses students in Grade 4 and 8 and in the last year of secondary education. The so-called Program for the International Assessment of Adult Competencies (PIAAC) tests adults between 16 and 65 years old. While all these assessments are about mathematics, they differ dramatically in difficulty. The test in primary schools assesses students’ basic numeracy, for example, simple equations with whole

10

Socioeconomic Inequality in Achievement

205

numbers, while the tests at the end of secondary education assesses advanced mathematics, such as calculus. Perhaps we propose it reasonable to demand that all children should perform at around the same level at the end of primary school (equality) but we do not demand equality at the end of secondary school. We find it reasonable to demand that all students in secondary school have a basic knowledge about simple equations to solve real-world problems, but we do not demand that all students are proficient in calculus (adequacy). To level the playing field, it seems fair that all students acquire basic mathematical skills. If one agrees, the standard deviation of achievement scores in international school achievement studies provides a suitable measure of educational inequality. The extent to which students then decide to continue studying mathematics at school or university is an individual decision. Following this argumentation, the standard deviation is no longer a suitable measure, because here implicitly any variation in performance is seen as problematic (including high proficiency). Accordingly, it would be better to focus on how many students do not reach certain minimum standards such as being able to solve simple equations. Any variation beyond this threshold is not problematic.

Socioeconomic Inequality: Implicit Assumptions As discussed above, the two concepts of equality and adequacy can be extended by a social dimension. Which inequalities or differences are perceived as problematic or not? To compare the socioeconomic achievement gap, we can compute the mean difference in performance between disadvantaged and privileged children. By definition this measure quantifies differences between children from different socioeconomic backgrounds, although it ignores any other differences. If the performance gap between socioeconomic groups is small, there may still be other gaps, such as between gender, race, and so on. In the same vein, we can compare the proportion of socioeconomically disadvantaged children who do not reach a certain basic literacy and numeracy performance level with the proportion of privileged children who do not reach these levels. However, again, even if the differences between social groups are small, that does not assume that all children are literate and numerate. So, is it socioeconomic gaps that are of interest, or something else? Brighouse et al. (2018, p. 57) question the gaps in socioeconomic status are the main problem, “what is really at stake may be the low achievement of members of the low-performing group rather than the size of the gap between the average achievement of the two groups. Here the relevant distributive value may be adequacy (. . .).” Achievement gaps between socioeconomic groups receive a tremendous amount of attention in the literature based upon international assessments. It is beyond the scope of this chapter to further discuss whether more concern should be given to other forms of inequality such as equality and adequacy. It is, however, important to acknowledge the implicit assumptions behind different measures. The crucial issue is the need to provide arguments concerning which information is considered to be relevant: equality, adequacy, or socioeconomic inequality.

206

R. Strietholt and A. Strello

Measurement Issues Indicators of Socioeconomic Status Mueller and Parcel (1981, p. 14) refer to the broader concepts of social stratification to define socioeconomic status as an individual’s position within a society: The term “social stratification,” for example, is used to describe a social system (usually a society or community) in which individuals, families, or groups are ranked on certain hierarchies or dimensions according to their access to or control over valued commodities such as wealth, power, and status. A case’s relative position (and associated score) on a particular hierarchy (or combination of hierarchies) may be referred to as its SES. (Mueller & Parcel, 1981, p. 14)

In studies of child development, the three common commodities used to measure SES are parental income, education, parental occupation, and parental education (Duncan et al., 1972; Gottfried, 1985; Hauser, 1994; Mueller & Parcel, 1981; White, 1982; Sirin, 2005). It is difficult to survey information on these indicators for several reasons. Students as well as their parents are often unable or unwilling to report income reliably (Moore et al., 2000). While these problems do not only apply to the recording of income, further problems arise when measuring occupations and educational qualifications in international studies. It is difficult to establish valid and reliable classification systems to put degrees and occupations into a hierarchy. International classification systems such as the ISCED (International Standard Classification of Education; UNESCO, 1997) and the ISEI (International Socioeconomic Index of Occupational Status; Ganzeboom et al., 1992) have been developed to address this issue. However, comparable coding is only possible to a certain extent in intercultural surveys due to national differences in the educational and economic systems (Jerrim et al., 2019). Furthermore, the coding of occupations is labor-intensive and therefore costly when the information is collected by means of an open question. In addition to income, occupation, and education, international large-scale assessments typically administer various questions on home possessions such a car, lawnmower, or the number of books. Home possessions are common indicators of SES because questions about the presence of a car, paintings, lawnmower, or the number of books are easier to answer than questions about parental income, professions, and education. Student data can be used to survey home possessions, which is an advantage in international studies, since parents do not often fill out the questionnaires and thus, the proportion of missing data is very high. However, regional and cultural differences remain an issue in international surveys. Owning a lawnmower is in many countries an indicator for having a garden, but very dry areas often do not have grass, therefore this indicator does not work in all countries. The number of cars is also less meaningful in urban areas than in rural ones. Rutkowski and Rutkowski (2013) provide evidence that the latent structure of items on home possessions varies across countries, so one should be careful when using the same items internationally to measure SES.

10

Socioeconomic Inequality in Achievement

207

The number of books in the family home is probably the most popular home possession indicator used to measure SES. It has been used for more than 100 years in educational research and is a part of the survey material for the majority of international assessments. The indicator is popular because books are theoretically closely linked to education, often correlated with high student achievement in all countries (Brese & Mirazchiyski, 2013; Hanushek & Woessmann, 2011). Engzell (2019), however, argues against the use of the book variable as student and parent data do not always match. He observed that girls often rate the number of books higher than boys and that disadvantaged children tend to underestimate the number of books. While this criticism is important, it is appropriate to point out that all indicators are imperfect; gender difference, for example, is not only observed for the number of books, but also for the student data on parental education. The meaning of different indicators of SES changes over time. Economic structural change often means the importance and prestige of certain occupations change with time, while new professions emerge, or gain prominence. Similarly, the importance of educational qualifications is also changing in the context of great educational expansion that can be observed worldwide in the past 100 years. Even within a few years, the significance of certain indicators of social status can change dramatically. For example, TIMSS data reveal that the share of 8th grade students who report more than 100 books at home decreased from 65% to 42% between 1995 and 2011 in Sweden (Beaton et al., 1996; Mullis et al., 2012). A possible explanation for such large differences is the spread of eBooks in recent years. Similar changes can be observed for computers and other digital devices. Watermann, Maaz, Bayer, and Roczen (2016) discuss whether SES is a multi- or unidimensional construct. If SES is considered a multidimensional construct, occupation is an indicator of social prestige, education is indicated by cultural resources, and income by financial liberties. On the other hand, if SES is considered to be unidimensional, all indicators are measures of the same latent constructs. International assessments like TIMSS and PISA typically compute and report SES indices which combine information from different components, such as the so-called PISA index of economic, social, and cultural status (ESCS), and the TIMSS index of home resources for learning (HER). Research papers based on international large-scale assessment (ILSA) data, however, use both single indicators as well as complex indices. A recent review of 35 international studies on SES inequality (Strietholt et al., 2019) reported that around half of the studies used single indicators, and the other half complex indices. The most common single indicators were the number of books at home and parental education.

Classification of Continuous and Categorical Measures There are many measures to quantify socioeconomic inequality in achievement which all combine socioeconomic background information with student achievement. Both socioeconomic status and achievement may be measured as categorical

208

R. Strietholt and A. Strello

Table 1 Classification of measures of socioeconomic inequality in achievement

Socioeconomic status

Categorical Continuous

Achievement Categorical 1 3

Continuous 2 4

or continuous variables. An example of a categorical indicator of the socioeconomic status is the comparison of students that have parents with or without tertiary education, and an example for a continuous indicator is the household income. The achievement scores in studies like PISA and TIMSS are examples of continuous achievement measures. The achievement scale is also divided into so-called “proficiency levels” (PISA) or “international benchmarks” (TIMSS), level 2 in PISA and the low benchmark in TIMSS are sometimes regarded as a baseline level of literacy. Following this approach, a common categorical achievement measure is whether students perform below a certain achievement threshold. According to Table 1, different measures of socioeconomic performance inequality can be classified into four different types, depending on whether performance and socioeconomic status are measured categorically or continuously: (1) if both status and performance are measured as categorical variables, a simple contingency table can be used to describe inequality and based on this information, measures such as the relative risk or odds ratios can be calculated. For example, if half of the disadvantaged children and a quarter of the privileged children do not reach a certain achievement level, the relative risk of disadvantaged children is two times higher than that of privileged children; (2) if status is measured categorically and performance continuously, the achievement gap may be computed as the simple difference in the average achievement of privileged children, and the average achievement of disadvantaged children; (3) if status is measured continuously and performance categorically, logistic regression can be used to regress the binary performance indicator on a continuous measure of the social status; and (4) if both status and performance are measured continuously, the covariance between the two variables, Pearson’s correlation or linear regression, can be used to assess the continuous performance level on a continuous measure of the status.

Standardization and Threshold Setting: International Comparability and National Specificity There is a common distinction between absolute and relative measures of inequality (see Heisig et al., 2019). Absolute measures are unstandardized measures of inequality and relative measures are standardized. Unstandardized measures use the metric of the achievement scale to quantify inequality. In studies like TIMSS and PISA, the achievement scale has an international mean of 500 with a standard deviation of 100 so that a SES achievement gap of 50 point corresponds to half an international standard deviation. To be more precise, the metric was set in the years of the first

10

Socioeconomic Inequality in Achievement

209

administration of the study and based on the countries that participated in that year; the same metric was used in subsequent years to facilitate trend analyses over time. However, the variation in test results typically varies by country, and standardized measures take these differences into account. If there is no variation in test scores in a county, there cannot be any achievement gaps. On the other hand, there could be a huge variation in test scores within one country. For example, in TIMSS 2015 (Grade 4) the standard deviation of the mathematics test scores was 57 points in the Netherlands and 107 points in Jordan. An achievement gap of 50 points corresponds to a relative achievement gap of about one standard deviation in the Netherlands, but only around half a standard deviation in Jordan. In addition to standardizing the performance variable, the grouping variable SES can also be standardized by country. Indicators for SES such as the number of books at home, parental education, or income are unequally distributed internationally, and such differences may be taken into account by standardizing the SES indicator. It is sometimes useful to divide a continuous scale using thresholds to ease the interpretation. For example, the concept of academic resilience focuses on students who succeed against the odds; resilience is defined by low status and high performance. To define low status and high performance either fixed or relative thresholds can be used; relative thresholds vary by country (see Ye et al., 2021). An example of a fixed threshold are the so-called benchmark levels in TIMSS. In the TIMSS report, all students who score at least 625 points in the TIMSS test are considered to be of an advanced level (e.g., Mullis et al., 2016). By applying fixed threshold for all countries, half of the students in countries such as Singapore and Hong Kong are classified at an advanced mathematical level. On the other hand, in several other countries, none or only a few percent of the student population reached this level. To address this, an alternative approach is to classify high performing students in each country separately by using relative thresholds that vary by country. For example, we can use the 75th percentile in each country to identify the 25% top performing students in each country. In the same vein fixed or relative thresholds can be used to define disadvantage. A drawback of relative thresholds is the substantive comparability of the groups across countries; in some countries even high performing students have only an understanding of whole numbers, while in other countries high performance means that students are able to solve linear equations and they also have a solid understanding of geometry.

Empirical Analyses Data and Variables We will next use TIMSS 2015 data to study different SES measures. In Grade 4, student achievement tests in mathematics and science and student, parent, teacher, and principal questionnaires were administrated. The student and parent questionnaires cover various items on SES that will be used to compute measures of SES inequality in mathematics. We use data including 245,060 students in 46 countries;

210

R. Strietholt and A. Strello

in each country around 5000 students from 150 to 200 schools were sampled. Data from England and the USA were not used because no parent questionnaire was given, and we also excluded the regions Dubai, Abu Dhabi, Ontario, Quebec, and Buenos Aires. Martin, Mullis, and Hooper (2016) provide further information on the study design and technical details. We used seven SES measures to capture a wide variety of indicator measures of parental education and occupation, as well as home possession. An income measure was not used because TIMSS does not include such an item. We consider student and parent data, categorical and continuous information, single items and a composite measure as follows: (1) (2) (3) (4) (5) (6)

Having access to Internet (dichotomous variable; student data) Having an own room (dichotomous variable; student survey) Number of books in the home (five ordered categories; student survey) Number of books in the home (five ordered categories; parent survey) Parental occupation (four ordered categories; parent survey) Parental education (five ordered categories; parent survey)

The seventh SES variable is the composite measure that combines five of the previously mentioned indicators (1), (2), (3), (5), and (6) and number of children’s books in the home (five ordered categories; parent survey). Martin, Mullis, and Hooper (2016, p. 15.33) provide detailed information on how item response theory was used to compute the continuous scale: (7) Home resources for learning (HRL; continuous HRL scale; parent and student surveys) The pooled international data from all countries contains 19–27% missing data for parent data and 2–3% for student data. For 20% of the students no information on the HRL score is available. The variation in missing items between student and parent data points to a practical issue for the measurement of SES in ILSA. Student data is typically surveyed in the classroom, while parents fill in the questionnaires at home. For this reason, the amount of missing data tends to be much higher for parent data. In some countries and studies, the response rate in the parent surveys is well above 50%, which reduces the sample size and may also introduce bias if the parent data is not missing at random.

Correlations Between the Different Measures of SES How much do the SES variables correlate with each other? Note that we initially only look at the SES indicators themselves, the SES performance gaps will be considered later. Table 2 shows the correlation between the SES measures at the student and country level for Grade 4 TIMSS data. For the sake of simplicity, we dichotomized the books variables (up to/more than 100 books), parental occupation

10

Socioeconomic Inequality in Achievement

211

Table 2 Student- and country-level correlations between different SES measures (1) Possession: Internet (student) (2) Possession: own room (student) (3) Books at home (student) (4) Books at home (parent) (5) Parental occupation (parent) (6) Parental education (parent) (7) HLR (student and parent)

(1) –

(2) 0.49*

(3) 0.59*

(4) 0.66*

(5) 0.69*

(6) 0.65*

(7) 0.87*

0.25*



0.34*

0.67*

0.29*

0.30*

0.62*

0.32* 0.34* 0.36* 0.39* 0.42*

0.18* 0.17* 0.15* 0.16* 0.30*

– 0.59* 0.36* 0.41* 0.72*

0.74* – 0.45* 0.51* 0.67*

0.45* 0.46* – 0.67* 0.71*

0.46* 0.42* 0.87* – 0.75*

0.78* 0.81* 0.75* 0.72* –

Note: Pooled international data from 46 countries; data sources are listed in parentheses (student or/and parent survey); on student level (below diagonal) polychoric correlation were computed for the correlations between categorical variables (1–6); the square root of the R2 retrieved from one-way ANOVAs were used to measure the correlations between the continuous HRL (7) scale and the other SES indictors; on country level (above diagonal) Pearson’s correlations were computed; * ¼ statistically significant at 5% level

(white/blue collar), and parental education (with/without tertiary education) in the country-level analyses. We then used these variables to compute the share of students who have access to the Internet, have their own room, more than 100 books at home, and so forth in each country. For the continuous HLR, we simply computed the country mean. The individual-level correlations are presented below the diagonal in Table 2. All variables correlate positively, but the strength of the correlations varies considerably. The HRL scale is composed of the individual items and it is thus not surprising that the composite measure shows the highest correlations with the individual indicators. Further, the number of books reported at home by parents, parental occupation, and parental education are more highly correlated with each other than the other measures. In the student measures, access to Internet, having their own room, and the number of books at home are more loosely correlated. It is also worth mentioning that the number of books reported by students is relatively highly correlated with the HLR scale. The country-level correlations are presented above the diagonal in Table 2 and they reveal interesting patterns. First, a more general finding is that the correlations are higher, on average, at the country level. This difference can be explained at least in part by the fact that measurement errors are less significant for aggregated data than at the individual level. Interestingly, the highest correlation with HLR can be observed for the access to the Internet indicator. Second, the share of students who have access to the Internet is the best proxy for the composite measure HRL on a country level. It should be noted, however, that there are some counties where hardly any students have access to the Internet and others where almost all students have. A SES indicator which is useful on a country level is not necessarily equally useful on an individual level. The decision of which SES indicators should be used in research depends on various reasons. Under the assumption that SES is a latent unidimensional construct, it is useful to combine the information from different items to increase the validity and

212

R. Strietholt and A. Strello

reliability of the measures. From this perspective, the composite HLR is a particularly useful measure. However, our analyses also indicate that even single items such as the number of books at home (both student and parent reports), parental education, and parental occupation may be sufficiently highly correlated proxies of SES. In contrast to the book variable, Internet access and owning a room are only weakly correlated with the composite measures and, therefore, insufficient proxies of SES.

Correlations Between the Different Measures of SES Inequality in Achievement Does the degree of inequality in a country depend on which indicators are used to measure SES, or are countries generally more or less unequal regardless of the indicator used to measure SES? To address this question, we computed different measures of SES inequality in achievement and estimated the correlations between them. Specifically, we first conducted a series of regression analyses where we regressed mathematics achievement on one of the dichotomous, ordered categorical or continuous SES indicators to compute the amount of variance (R2) each SES indicators explain in mathematics achievement. We replicated the analyses for each country seven times to achieve seven SES inequality measures for each country. The possession items, access to the Internet, and owning a room explain on average only about 3% of the variance in achievement; the two books variables, parental education, and occupation explain around 10%; and HLR variable explains approximately 15%. However, the inequality measures vary across countries and, in a second step, we compute the correlations between these measures that are depicted in Table 3. We observe reasonably strong correlations between the measures of SES inequality in mathematics achievement which are based on parent data. The associations between the inequality measures are higher than the associations between the SES indicators themselves presented above. For example, the correlations between all inequality measures based on the parent survey correlate, r ¼ 0.8–0.9. Such high correlations suggest that different measures of SES inequity in mathematics achievement lead to a similar ranking of countries. However, as Anscombe’s (1973) quartet Table 3 Correlations between different measures of SES inequality in mathematics achievement (1) Possession: Internet (student) (2) Possession: own room (student) (3) Books at home (student) (4) Books at home (parent) (5) Parental occupation (parent) (6) Parental education (parent) (7) HLR (student and parent)

(1) –

(2) 0.07 –

(3) 0.29* 0.20

(4) 0.39* 0.10

(5) 0.47* 0.08

(6) 0.56* 0.07

(7) 0.49* 0.21



0.81* –

0.77* 0.86* –

0.74* 0.81* 0.87* –

0.88* 0.84* 0.85* 0.90* –

Note: Pooled international data from 46 countries; data sources are listed in parentheses (student or/and parent survey); Pearson’s correlation; * ¼ statistically significant at 5% level

10

Socioeconomic Inequality in Achievement

213

Fig. 1 Plot of two measures of SES inequality in mathematics achievement

shows, numerical calculations for correlations can be misleading and distributions can look different when graphed. For example, outliers or clusters of data points can artificially lead to high correlation. Correlations Fig. 1 plots the achievement gap by parental education (x-axis) and the gap by the number of books reported by the parent. The figure reveals that the high correlation is at least in part driven by the extreme values in Hungary, Slovakia, and Turkey. The correlations decrease in the middle of the distribution; in Denmark, Italy, Cyprus, and Korea, for example, the achievement gaps between children with up to more than 100 books is at the same level, while the gap between children of parents with and without higher education varies dramatically across these countries. From the perspective of a single country like Denmark or Korea, how SES has been operationalized makes a considerable difference. With regard to the questions in the student survey, comparably high correlations with the SES measures from the parent survey can only be observed for the student

214

R. Strietholt and A. Strello

reported number of books variable. The two inequality measures that are based on the SES indicators access to the Internet and owning a room correlate much lower than with the other items.

Standardization of Inequality Measures: Relative and Absolute Measures In the pooled international data of TIMSS and other international assessments, the standard deviation of the achievement scale is 100 points, but it varies between countries. For example, the standard deviation was 104 points in Kuwait and 64 in Korea. Figure 2 used two different metrics to measure the achievement gap by parental education. The x-axis plots the absolute achievement gap defined as the mean differences between children of parents with, versus without, a university degree using the international metric of the achievement scores, and the y-axis shows the relative gap which is standardized by dividing the absolute gaps by the standard deviation in the respective countries. The plot shows that both measures are correlated but the associations are not perfect. For example, the achievement gap in both Kuwait and Korea corresponds to roughly 50 points when using the international achievement scale as a metric for the achievement gaps, while the standardized achievement gaps suggest that the SES inequality is much larger in Korea compared to Kuwait. It should be noted that the previously used R2 measure is another approach to standardize measures of SES inequality, because here the proportion of explained variance is reported, which can take values between zero and one, independent of how much variance exists within countries. Further, it can be useful to standardize not only achievement, but also the SES to ease the interpretation of associational measures. For example, it is typically easier to interpret the correlation between two continuous variables than their covariance. But in many cases, however, it is difficult to interpret transformed variables. For example, the comparison of blue- versus white-collar workers is an easy-to-communicate measure of occupation. At the same time, such a comparison has different meanings in developing and advanced economy. The variation in achievement is a natural limitation for the SES achievement gaps. If there is no variation within countries, there cannot be an SES achievement gap. To illustrate this, Fig. 3 plots the SES gaps with the standard deviation of the achievement scores. In both panels the y-axis shows the standard deviation of the achievement scores within the countries. The left panel shows absolute SES gaps based on the original TIMSS scale, while relative gaps are presented in the right panel. Note that the absolute and relative gaps are the same as that used in Fig. 2, except for that they are now both plotted on the x-axis. The comparison shows that absolute SES performance gaps are larger in those countries where the standard deviation of performance is also large (r ¼ 0.56). This association vanishes in the right panel with the standardized relative SES gaps (r ¼ 0.13). The comparison reveals that absolute SES gaps are not only affected by the difference between SES groups, but

10

Socioeconomic Inequality in Achievement

215

Fig. 2 Absolute and relative achievement gaps by parental education

also by the overall variation in the achievement scores. The relative measure may be conceived as a purer measure of SES inequality which is not affected by the overall variation in achievement. Different measures of SES inequality are best understood and interpreted in context. Whether one should use absolute or relative inequality measures cannot be answered unanimously. Advantages of using the original achievement scales from TIMSS and other international assessments are that they are well documented and established in the research community. The study reports provide detailed information on the mathematics content that students master at certain levels of the achievement scale. Further, it is well documented that the learning progress of an additional school year toward the end of primary school corresponds to roughly 60 points in TIMSS (Luyten, 2006; Strietholt et al., 2013). In the same vein, studies like PIRLS, PISA, and TIMSS have now been conducted for 20 or even more years and

Fig. 3 Association between the absolute and relative gaps with the overall variation in achievement

216 R. Strietholt and A. Strello

10

Socioeconomic Inequality in Achievement

217

researchers have developed a reasonably good understanding of how the performance within countries changes from one cycle to another. This kind of interpretability of performance scores is limited as soon as the values are standardized.

Concluding Remarks One of the most salient findings in international comparative large-scale studies on student achievement are the large SES gaps. These findings have been replicated in several studies (Volante et al., 2019), and an increasing number of studies investigate the institutional determinates that moderate the association between SES and achievement (Strietholt et al., 2019). Studying SES inequality from a comparative perspective using ILSA data has at least two methodological advantages. First, many institutional features of educational systems do not vary within a single country (e.g., the existence of central exit exams) so that international comparative studies are the only approach to observe variation in such features. Second, selection mechanisms within educational systems make it difficult to study socioeconomic inequality within a single country. For example, in a tracked school system socially advantaged students are often overrepresented in higher tracks, while disadvantaged children are overrepresented in lower tracks. Analyses at the country level avoid such selection bias and provide a more complete picture of the degree of SES inequality within a country. At the same time, there are several conceptual and methodological challenges for researchers when describing and investigating SES inequality using data from international school achievement studies. First, different indicators are being used to measure SES. On the one hand, inequality measures based on parental education, parental occupation, home possessions, and composite measures that combine different indicators are fairly high correlated (r ¼ 0.8 and higher); the ranking of the individual countries is frequently quite different for different indicators. In particular, national policymakers who are largely interested in mapping their own countries in comparison to others are well advised to consider which SES indicator(s) they consider relevant. Second, it is impossible to make general recommendations as to which indicators should be used to measure inequality. International classification schemes such as the ISEI and ISCED have been developed to compare occupations and educational degrees internationally, but the cross-cultural validity of these measures is not perfect. Home possessions are a much-needed proxy for income but it is extremely difficult to identify possessions that function similarly in poor and rich countries, as well as in urban and rural areas. For example, there is little variation in the access to the Internet within rich and poor countries, in that either everyone or no one has access to the Internet. The challenge of finding suitable indicators for ILSAs is also reflected in the constantly changing home possessions scales. In studies like PIRLS, PISA, and TIMSS there are hardly any items that have been administrated continuously across multiple study cycles. An exception is the well-established variable on the number of books at home, which is administrated in all ILSAs. Although this variable is not perfect either, it has important advantages. There is a high face

218

R. Strietholt and A. Strello

validity since books are important for education and the variable has a variation across a large range of values (there are typically about five categories, e.g., 0–10 books, 11–25 books, etc.). In contrast, the number of cars in the household, for example, can take a positive value in theory, but in practice there are hardly any families that have more than two cars. Third, SES is a moving target. Parental education, their profession, and family income have been, and will probably remain, important characteristics for educational careers of children in the future. However, social and economic systems change over time and this change has consequences for the study and analysis of SES. Graduation rates in higher education have risen considerably over the last 100 years in many countries; new professions have emerged while others have lost importance. Home possessions that were good indicators of wealth some years ago are now accessible to a wide range of people, and digitalization is replacing printed books with eBooks. SES research must achieve a balance between continuity and change, ensuring comparability over time and making necessary adjustments. We do not want to be misunderstood here; far too often items and instruments are changed only to return to the original version a few years later. Changes must not be an end in themselves; they are only legitimate if there are substantial improvements. Fourth, this chapter focused on socioeconomic status, but it is clear there are other social categories such as gender and race. In order to investigate educational justice, it is probably insufficient to look only at SES. In the section Socioeconomic Inequality: Implicit Assumptions we have raised the question as to whether we should be concerned about SES gaps, or rather children who have been left behind. Of course, the socially disadvantaged are more often left behind, but what really matters here is not the difference between social groups, but the fact that some students are left behind.

Cross-References ▶ Family Socioeconomic and Migration Background Mitigating Educational-Relevant Inequalities ▶ Perspectives on Equity: Inputs Versus Outputs ▶ Using ILSAs to Promote Quality and Equity in Education: The Contribution of the Dynamic Model of Educational Effectiveness

References Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17–21. Atkinson, A. B., & Bourguignon, F. (Eds.). (2000). Handbook of income distribution (Vol. I). North-Holland. Atkinson, A. B., & Bourguignon, F. (Eds.). (2015). Handbook of income distribution (Vol. II). North-Holland.

10

Socioeconomic Inequality in Achievement

219

Beaton, A. E., Mullis, I. V. S., Martin, M. O., Gonzales, E. J., Kelly, D. L., & Smith, T. A. (1996). Mathematics achievement in the middle school years: IEA’s third international mathematics and science study. Center for the Study of Testing, Evaluation, and Educational Policy, Boston College. Brese, F., & Mirazchiyski, P. (2013). Measuring students’ family background in large-scale international education studies. IEA–ETS Research Institute. Brighouse, H., Ladd, H. F., Loeb, S., & Swift, A. (2018). Educational goods. Values, evidence, and decision-making. The University of Chicago Press. Duncan, O. D., Featherman, D. L., & Duncan, B. (1972). Socio-economic background and achievement. Seminar Press. Engzell, P. (2019). What do books in the home proxy for? A cautionary tale. Sociological Methods & Research, 50(4), 1487–1514. https://doi.org/10.1177/0049124119826143. Ganzeboom, H. B. G., De Graaf, P. M., & Treiman, D. J. (1992). A standard international socioeconomic index of occupational status. Social Science Research, 21(1), 1–56. Gottfried, A. (1985). Measures of socioeconomic status in child development research: Data and recommendations. Merrill-Palmer Quarterly, 31(1), 85–92. Hanushek, E. A., & Woessmann, L. (2011). The economics of international differences in educational achievement. In E. A. Hanushek, S. Machin, & L. Wößmann (Eds.), Handbook of the economics of education (Vol. 3). Elsevier. Hauser, R. M. (1994). Measuring socioeconomic status in studies of child development. Child Development, 65(6), 1541–1545. Heisig, J. P., Elbers, B., & Solga, H. (2019). Cross-national differences in social background effects on educational attainment and achievement: Absolute vs. relative inequalities and the role of education systems. Compare, 50(2), 165–184. https://doi.org/10.1080/03057925.2019.1677455 Jerrim, J., Volante, L., Klinger, D. A., & Schnepf, S. V. (2019). Socioeconomic inequality and student outcomes across education systems. In Socioeconomic inequality and student outcomes (pp. 3–16). Springer. Luyten, H. (2006). An empirical assessment of the absolute effect of schooling: Regressiondiscontinuity applied to TIMSS-95. Oxford Review of Education, 32(3), 397–429. Martin, M. O., Mullis, I. V. S., & Hooper, M. (Eds.). (2016). Methods and procedures in TIMSS 2015. TIMSS & PIRLS International Study Center, Boston College. Moore, J. C., Stinson, L. L., & Welniak, E. J., Jr. (2000). Income measurement error in surveys: A review. Journal of Official Statistics, 16(4), 331–361. Mueller, C. W., & Parcel, T. L. (1981). Measures of socioeconomic status: Alternatives and recommendations. Child Development, 52(1). https://doi.org/10.2307/1129211 Mullis, I. V. S., Martin, M. O., Foy, P., & Arora, A. (2012). TIMSS 2011 international results in mathematics. TIMSS & PIRLS International Study Center, Boston College. Mullis, I. V. S., Cotter, K. E., Centurino, V. A. S., Fishbein, B. G., & Liu, J. (2016). Using scale anchoring to interpret the TIMSS 2015 achievement scales. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and procedures in TIMSS 2015 (pp. 14.1–14.47). Retrieved from TIMSS & PIRLS International Study Center, Boston College, website: http://timss.bc.edu/ publications/timss/2015-methods/chapter-14.html Nozick, R. (1974). Anarchy, state and Utopia. Basic Books. Robeyns, I. (2006). Three models of education: Rights, capabilities and human capital. Theory and Research in Education, 4(1), 69–84. https://doi.org/10.1177/1477878506060683 Rutkowski, D., & Rutkowski, L. (2013). Measuring socioeconomic background in PISA: One size might not fit all. Research in Comparative and International Education, 8(3), 259–278. https:// doi.org/10.2304/rcie.2013.8.3.259 Sirin, S. R. (2005). Socioeconomic status and academic achievement: A meta-analytic review of research. Review of Educational Research, 75(3), 417–453. https://doi.org/10.3102/00346543075003417 Strietholt, R. (2014). Studying educational inequality: Reintroducing normative notions. In R. Strietholt, W. Bos, J.-E. Gustafsson, & M. Rosén (Eds.), Educational policy evaluation through international comparative assessments (pp. 51–58). Waxmann.

220

R. Strietholt and A. Strello

Strietholt, R., Rosén, M., & Bos, W. (2013). A correction model for differences in the sample compositions: The degree of comparability as a function of age and schooling. Large-scale Assessments in Education, 1(1), 1–20. https://doi.org/10.1186/2196-0739-1-1 Strietholt, R., Gustafsson, J. E., Hogrebe, N., Rolfe, V., Rosén, M., Steinmann, I., & Yang-Hansen, K. (2019). In L. Volante, S. V. Schnepf, J. Jerrim, & D. A. Klinger (Eds.), The impact of education policies on socioeconomic inequality in student achievement: A review of comparative studies. Springer. UNESCO. (1997). ISCED 1997: International Standard Classification of Education. UNESCO – Institute for Statistics. Van Doorslaer, E., & Van Ourti, T. (2011). Measuring inequality and inequity in health and health care. In S. Glied & P. C. Smith (Eds.), The Oxford handbook of health economics (pp. 837–869). Oxford University Press. Volante, L., Schnepf, S. V., Jerrim, J., & Klinger, D. A. (2019). Socioeconomic inequality and student outcomes. Cross-national trends, policies, and practices. Springer. Watermann, R., Maaz, K., Bayer, S., & Roczen, N. (2016). Social background. In S. Kuger, E. Klieme, N. Jude, & D. Kaplan (Eds.), Assessing contexts of learning (pp. 117–145). Springer. White, K. R. (1982). The relation between socioeconomic status and academic achievement. Psychological Bulletin, 91(3), 461–481. https://doi.org/10.1037/0033-2909.91.3.461. Ye, W., Strietholt, R., & Blömeke, S. (2021). Academic resilience: underlying norms and validity of definitions. Educational Assessment, Evaluation and Accountability, 33(1), 169–202. https:// doi.org/10.1007/s11092-020-09351-7.

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be Sure that They Measure What They Are Supposed to Measure?

11

Hans Luyten and Jaap Scheerens

Contents Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Previous Research on OTL Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Validity of OTL Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Research Approach and Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Assessment of Content Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Assessment of Convergent Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Content Validity: Comparison of the Content of the TIMSS and PISA OTL Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Convergent Validity: The Association of OTL with Student Performance in TIMSS and PISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

222 223 223 226 227 229 229 229 231 231 238 245 248 249

Abstract

This chapter presents assessments of the content and convergent validity of the opportunity to learn (OTL) measures in the PISA and TIMSS surveys. Assessments of content validity are based on the correspondence between the OTL measures and the frameworks that guided the development of the mathematics tests in both surveys. Conclusions with regard to convergent validity are based on the statistical association between OTL and mathematics achievement. The

H. Luyten · J. Scheerens (*) University of Twente, Enschede, The Netherlands e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_12

221

222

H. Luyten and J. Scheerens

chapter points to remarkable differences between PISA and TIMSS about the way OTL is measured. The content validity of the OTL measure in TIMSS appears to be more credible, but the PISA measure shows a much stronger association with mathematics achievement. Keywords

Opportunity to learn · TIMSS · PISA · Mathematics achievement · Content validity · Convergent validity

Research Problem The concept of Opportunity to Learn, abbreviated as OTL, is commonly used to compare content covered, as part of the implemented curriculum, with measures of student achievement. Opportunity to learn is generally considered as a characteristic of effective education (Caroll, 1963, 1989; Mullis et al., 2012; OECD, 2014a). In the cross-national TIMSS surveys (Trends in Mathematics and Science Studies), which have been conducted on a 4-year basis since 1995, the association between the content taught and tested has always been a major topic of interest. On the other hand, the cross-national PISA surveys (Program of International Student Assessment), which have been conducted on a 3-year basis since 2000, focus primarily on the assessment of cognitive skills that young people need to succeed later in life. Therefore, information on opportunity to learn was not collected in the early PISA studies. In 2012, this situation was changed and data on the association between content taught and tested was collected for the first time with regard to mathematics. One of the more recently developed OTL measures in PISA shows a remarkably strong correlation with student achievement (OECD, 2014a; Schmidt et al., 2015). The commonsense logic that educational effectiveness is served by a good correspondence between what is taught and tested seems straightforward enough. Yet, conceptual analyses point at considerable complexity, and empirical research studies show diverging outcomes (Kurz et al., 2010; Luyten, 2017). In the analytic part of this chapter we highlight the conceptual framework for OTL and take a glance at the state of the art of empirical research, in terms of outcomes and methodological approaches. The conceptual framework and research review are needed to provide an “action theory” of OTL (Kane 2008). This serves as the background for the analytic and empirical contents of this chapter, which is framed in terms of assessing facets of the validity of OTL measures. The empirical material that is used is drawn from two international assessment studies, TIMSS 2011 (Mullis et al., 2012) and PISA 2012 (OECD, 2014a) and the validity facets that are addressed are content and convergent validity. The relevance of furthering the knowledge base about OTL is that it is to be seen as one of the key concepts in educational effectiveness thinking and that it has high potential to be applied in the context of educational accountability and

11

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . .

223

school improvement. The practical relevance of improving our substantive knowledge about OTL and its measurement is given in its potential to explain achievement differences in summative and formative assessment results, and using the OTL measures as a basis for feedback- and improvement-oriented action.

Previous Research on OTL Effects Research studies that related Opportunity to Learn to student achievement have been incorporated in several meta-analyses of educational effectiveness research. A series of meta-analyses carried out in the 1970s and 1980s (Bruggencate et al., 1986; Horn & Walberg, 1984; Husen, 1967; Pelgrum et al., 1983) showed average to large effect sizes, ranging from .45 to 1.36. In more recent meta-analyses (Hattie, 2009; Marzano, 2003; Scheerens & Bosker, 1997; Scheerens et al., 2007;) the effect sizes that were found were more modest, ranging from .18 to .39. In these meta-analyses Opportunity to Learn was mostly operationalized as a school level factor. Seidel and Shavelson (2007), addressed classroom level teaching interpretations of OTL and found a very small effect of .04. Another metaanalysis that addressed OTL at teaching level by Kyriakides et al. (2013) reported an effect size of .18. Lamain et al. (2017) carried out a “vote count” analyses on 51 empirical studies conducted between 1995 and 2005. The share of studies that found a significant positive effect for OTL was 44%. Results from these metaanalyses and review studies show considerable variation in effect sizes, which should probably be partly attributed to a wide divergence between studies in the way OTL was operationalized and measured. A range of major primary research studies confirms this overall impression of strongly diverging outcomes in terms of effect sizes and heterogeneity in the empirical methods to asses OTL (Polikoff & Porter, 2014; Porter et al., 2011; Schmidt et al., 2001, 2011, 2015). The average effect size of OTL on student achievement, which emerges from this literature, is conservatively estimated at .30. In comparison to other frequently studied correlates of student achievement (like educational leadership and cooperation between teachers) the OTL effect size is relatively large, although in absolute terms modest at best (Scheerens, 2017). This underlines the relevance of OTL and indicates that it is worthwhile to improve the consistency in operationally defining and measuring OTL.

Conceptual Framework As a working definition to start with, we propose to define OTL as the matching of taught content with tested content. As such OTL is to be seen as part of the larger concept of curriculum alignment in educational systems. When national educational systems are seen as multilevel structures, alignment is an issue at each specific level, but also an issue of connectivity between different layers. General

224

H. Luyten and J. Scheerens

education goals or national standards are defined at the central level (the intended curriculum). At intermediary levels (between the central government and schools) curriculum development, textbook production, and test development take place. At the school level, school curricula or work plans may be used, and at classroom level, lesson plans and actual teaching are facets of the implemented, or enacted, curriculum. Test taking at individual student level completes the picture (the realized curriculum). This process of gradual specification of curricula is the domain of curriculum research, with the important distinction between the intended, implemented and realized curriculum, as a core perspective. This perspective is mostly associated with a proactive logic of curriculum planning as an approach that should guarantee a valid operationalization of educational standards into planning documents and implementation in actual teaching. Clarifying OTL within a context of curriculum alignment is further elaborated by Kurz et al. (2010). He uses a similar deductive logic from the intended curriculum to the realized or assessed curriculum as described in the above. The subsequently more specific curriculum forms he mentions are: (at system level): the general intended curriculum and the assessed curriculum (e.g., in national high stakes achievement tests (The assessed curriculum for accountability purposes is designed at the system level in alignment with the intended curriculum.)); (at teacher level): the planned and enacted curriculum; (at student level): the engaged, learned, and displayed (or assessed) curriculum. Within this overall framework the enacted curriculum plays a central role in the definition of OTL, which Kurz describes as the opportunity for students to learn the intended curriculum. It should be noted that within this hierarchical framework of curriculum specifications there are two kinds of informational sources, documents, and artifacts on the one hand, and teacher and student behavior on the other. Alignment of content elements and desired cognitive operations between the different curriculum forms tests and assessment programs can be studied by means of document analyses. Alignment methodologies such as the Surveys of the Enacted Curriculum (SEC; Porter & Smithson, 2001) determine the match between the content elements and cognitive process expectations of two curriculum forms, e.g., between the intended curriculum and the content of national assessment test, which is subsequently expressed in an “alignment index” (AI; Polikoff & Porter, 2014). The enacted curriculum as taught by teachers and the displayed or assessed curriculum as manifested by students depend on behavioral observation, self-reports, and test results. Within the multilevel conceptual framework of curriculum alignment, some associations do not just depend on alignment of different planning documents or scripts for behavior, but also on a behavioral check of the degree to which they are actually implemented. For expectations about the “tightness of coupling” between more general and more specific curriculum forms, it is important to see that for some of the associations they depend on both alignment of planning documents and scripts and behavioral enactment. For the most common interpretation of OTL, the association between the curriculum as enacted by teachers and the assessed curriculum (student test scores),various types of misalignment might be the source of low correlations: misalignment between the planned curriculum

11

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . .

225

and the content of the assessment, the lack of implementation of the planned curriculum by the teacher, and suboptimal learning to master the content by students. The emerging conceptual framework of OTL has the following characteristics: – A multilevel framework of educational systems (central government, intermediary organizational structures, schools, and classrooms) – A hierarchy of curricular elements (standards, national curricula, textbooks, school curricula, actual teaching and learning of content at specific levels of cognitive complexity, test contents, and test results). – Seeing the dynamics of the functioning of these hierarchical systems under the label of alignment, which holds the expectation that tight coupling between levels and curricular elements enhances educational performance. – Recognition that some manifestations of alignment can be assessed in terms of deductive logic (lower level curricular elements should cover higher, more abstract elements) and others as manifestation of implementations of plans into educational behavior. – Recognition that alignment relationships will tend to be suboptimal, as a result of imperfect implementation, or even deliberate choices not to try and operationalize all facets of general educational goals. This last characteristic is quite problematic as it would upset a straightforward “linear” optimization strategy and call for more “conditional” types of optimization (e.g., deliberately leaving out applied skills from summative tests or examinations but seeking full coverage of all other intended competencies). Another issue that may add to the complexity of enhancing alignment and OTL is the question to what extent the alignment process is seen as closely managed and monitored or left at the free play of independent organizational units. Interestingly educational systems sometimes combine ambitions on standardization and control on the one hand, with the stimulation of autonomy (of schools and teachers) on the other. With respect to OTL, measured as the enacted curriculum, some authors argue for enlarging the scope of the concept. The basic definition of OTL refers to educational content. Further elaboration of this basic orientation considers qualitatively different cognitive operations in association with each content element, often expressing ascending levels of cognitive complexity. In TIMSS, for example, items are categorized with regard to content but also with regard to cognitive domains (knowing, applying, reasoning; Mullis et al., 2009). A next step in enlarging the scope is to add an indication of the time students were exposed to the specific content elements. Sometimes the theoretical option to include quality of deliverance to the OTL rating is considered as well. Kurz et al., (2010) presents a comprehensive conceptualization of OTL with content, time, and quality as main facets. Adding the dimension of quality of instructional deliverance stretches the OTL definition to a degree that it approaches a measure of overall instructional quality. In this chapter a less comprehensive definition OTL is used, which refers to coverage of educational objectives and leaves aside quality of instruction and amount of time invested.

226

H. Luyten and J. Scheerens

The Validity of OTL Measures “The concept of validity has undergone major changes over the years. Most prominently, it has moved from being rather narrow and evidence based to becoming complex and broad, including also the consequences of instruments and assessment procedures” (Wolming & Wikstrӧm, 2010; p. 117). From their overview four major phases in the development of validity theory can be discerned that initially “everything a test correlated with” was regarded as evidence for its validity (with reference to Guilford, 1946). In a next phase the validity construct was further diversified into several “types” of validity, depending on the purpose of the test. Content validity was used for tests describing an individual’s performance on a defined subject. Criterion-related validity was used for tests predicting future performance. Construct validity was used to make inferences about psychological traits like intelligence or personality (Wolming & Wikstrӧm, 2010; p 118). In a third phase construct validity was seen as the overarching construct, where the more specific types of validity, referred to in the above, were all seen as contributing to an overall validity assessment. Messick (1995, p. 13) described construct validity as an “integrated evaluative judgement of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores” (Wolming & Wikström 2010). A further broadening in the scope of validity theory came from contributions by Kane (1992, 2006), who proposed an “argumentative approach” to validity. Validity was regarded as an overall assessment of a “theory of action” in which test application is situated. An additional element is “consequential validity,” which encompasses assessment of the utility of tests and social consequences. In educational applications, the high stakes nature of some test applications fits the idea of an action theory and consequential implications for actors, in the sense of desired effects and undesired side effects for actors (teachers and students). The alignment model, which was presented in a previous section can be interpreted as a theory of action, with student achievement as the ultimate criterion and a complex chain of intermediary elements and associations between them as instrumental to measures of educational performance. The broadening of the scope of validity theory has met with criticism, by authors who argued for going back to basics and returning to narrower validity concepts (Mehrens, 1997, cited by Wolming & Wikström, 2010; p. 117). These latter authors conclude that they had not found practical examples of fully fledged application of the broad interpretative approach as proposed by Kane. In the exemplary studies they describe the more conventional validity instances, content, concurrent, and criterion validity remain the major elements. In this study we concentrate on content validity and a specific aspect of construct validity of OTL, namely, convergent validity. Content validity involves the degree to which the content of the test matches a content domain associated with the construct (cf Carmines & Zeller, 1979). In the case of OTL this means that taught content is studied for its match with intended and tested content. We address construct validity

11

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . .

227

from the perspective of the theory of action presented in our alignment model. This means that the association of OTL measures of enacted teaching with student achievement is seen as the touchstone of its validity. This specific subtype of construct validity is frequently referred to as convergent validity (Messick, 1995). To the degree that we tentatively make conjectures based on the comparison between OTL measures in TIMSS and PISA, we will also be addressing concurrent validity (the degree to which an operationalization correlates with other measures of the same construct) (We also considered labeling the association between OTL and math performance as an instance of “criterion” validity, with performance seen as an external criterion, but ultimately did not for this, as it could be argued that the criterion variable should measure the same concept as the “predictor” or independent variable. Convergent fits the reference to a substantive theory of OTL as evident in construct validity).

Research Approach and Research Questions The data on OTL collected in PISA and TIMSS represents an unparalleled amount of statistical information on test-curriculum overlap, as it relates to vast numbers of schools, teachers, and students in numerous countries. The present chapter addresses the content validity and the convergent validity of the OTL measures in PISA and TIMSS with regard to mathematics in secondary education (based on the definitions of content and construct validity in Carmines & Zeller, 1979). Further knowledge in this respect is highly relevant, as OTL is generally considered to be an important aspect of effective education and the TIMSS and PISA datasets are widely recognized as important sources of information for academics, policy makers, and the general public on the knowledge and skills of students across the globe. For example, based on analyses that involve the OTL measure in the PISA dataset, Schmidt et al. (2015) conclude that OTL accounts for approximately one third of the relationship between the socioeconomic background (SES) of students and mathematics achievement and also that exposure to content may exacerbate such inequalities. Clearly, the credibility of such a conclusion hinges on the validity of the measures used to express the key variables in the analysis. In this chapter, it will be shown that the OTL measures in PISA and TIMSS vary in several key respects and that they lead to different conclusions about the amount of content covered per country (see Fig. 3 below, under “Key Findings with Regard to Convergent Validity”). The relevance of OTL as a facet of effective education is widely recognized, but consensus on what is its most appropriate operationalization (in large-scale educational surveys) has yet to be achieved. In order to advance the development of theory and research on educational effectiveness, consensus on the conceptualization and measurement of key explanatory variables is of elementary importance. The current chapter focuses on OTL as a school level characteristic. OTL can also be viewed as a system level characteristic and may as such be able to account for differences in student achievement between countries. It can also be conceived as a classroom characteristic, as teachers within school may differ in the topics they

228

H. Luyten and J. Scheerens

Appendix : Number of math teachers per school in TIMSS; number of students per school in PISA

Australia Chile Finland Hong Kong Hungary Indonesia Israel Italy Japan Korea (South) Lithuania Malaysia Norway New Zealand Qatar Romania Singapore Slovenia Sweden Thailand Tunesia Turkey Taiwan United Arab Emirates United States of America Average across countries Median across countries

TIMSS 2011 Number Number of of schools teachers 227 802 184 194 142 264 115 148 145 280 153 170 151 514 188 205 138 181 148 376

Teachers per school 3.53 1.05 1.86 1.29 1.93 1.11 3.40 1.09 1.31 2.54

PISA 2012 Number of schools 775 221 311 148 204 209 172 1194 191 156

Number of students 14481 6856 8829 4670 4810 5622 5055 31073 6351 5033

Students per school 18.7 31.0 28.4 31.6 23.6 26.9 29.4 26.0 33.3 32.3

141 180 132 154 107 146 165 181 142 171 205 238 150 430

262 180 175 354 195 248 330 523 405 172 207 240 162 603

1.86 1.00 1.33 2.30 1.82 1.70 2.00 2.89 2.85 1.01 1.01 1.01 1.08 1.40

216 164 197 177 157 178 172 338 209 239 153 170 163 458

4618 5197 4686 4291 10966 5074 5546 5911 4736 6606 4407 4848 6046 11500

21.4 31.7 23.8 24.2 69.8 28.5 32.2 17.5 22.7 27.6 28.8 28.5 37.1 25.1

384

559

1.46

162

4978

30.7

180.7

310.0

1.75

269.4

7287.6

29.2

153.0

248.0

1.46

191.0

5197.0

28.5

cover. Finally, teachers may choose to vary the content taught between students within their classrooms (e.g., teaching enriched or advanced content to the most talented students). However, questions with regard to OTL at either the individual, class, or system level are beyond the scope of the current chapter.

11

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . .

229

Method Assessment of the content and convergent validity of the OTL measures in TIMSS and PISA is based on a prior study regarding the predictive power of OTL measures (Luyten, 2017). The focus is on 25 countries that participated in both PISA 2012 and TIMSS 2011 with eighth-grade student populations. The analyses relate to math achievement in secondary education, so that the student populations (in PISA: 15-year-olds in 2012: in TIMSS: eighth-grade students in 2011), the outcome measure (mathematics achievement), and a vital control variable (numbers of books at home) in PISA and TIMSS are highly similar. It is also important to note that this chapter focuses on OTL at school level. Other perspectives (system level, class level, individual level) are not addressed in the current chapter.

Assessment of Content Validity Assessing the content validity of an OTL measure boils down to the question whether the essential facets of the concept are covered by the measure. As OTL relates to the alignment between the implemented curriculum to educational outcomes, a valid measure of OTL should cover the same topics as the ones covered by the measure of student achievement. In order to assess the content validity of the OTL measures featuring in the TIMSS and PISA reports (Mullis et al., 2012; OECD, 2014a), both OTL measures will be described in more detail. The facets identified in the review of OTL measures will serve as a framework for the descriptions that will point to the differences and similarities in TIMSS and PISA with regard to the operationalization of OTL. Next the frameworks that guided the development of the mathematics tests in both surveys will be described as well. Conclusions with regard to the content validity of both OTL measures will be based on the correspondence between the OTL measures and the frameworks that guided the development of the mathematics tests in TIMSS and PISA.

Assessment of Convergent Validity The assessment of the convergent validity will be based on the statistical relations between the OTL measures in TIMSS and PISA and math achievement in 25 different countries. Strong associations between OTL and achievement are considered as evidence that supports its validity, as OTL is believed to enhance educational effectiveness. Through focusing on math achievement in secondary education, the student populations, the outcome measure, and the control variable (numbers of books at home) are highly similar in TIMSS and PISA. Thus, difference in the statistical association between OTL and achievement in PISA and TIMSS can be assessed, while minimizing the impact of additional factors. The analyses focus on school-level aggregates and do not address variation in

230

H. Luyten and J. Scheerens

student achievement within schools. This is done with the intention to reach a “level playing field” when comparing the findings in TIMSS and PISA. In PISA, OTL is based on student reports. This implies that OTL scores may differ between students within schools and classes. In the TIMSS dataset, information on OTL is based on teacher reports. The exact numbers vary across schools and countries, but on average the information derives from one or two mathematics teachers per school (see the appendix for details). In PISA, it is not possible to link students to teachers or classes. The PISA data only indicates to which school a student belongs. Information on class membership is not available. Therefore, it was decided to conduct both the analyses of the TIMSS and PISA data on school level aggregates with schools weighted by the number of students they represent. The sampling designs of both surveys are somewhat different. PISA selects a sample of approximately 35 respondents out of all students in a school in the intended age range (15-year-olds). TIMSS, on the other hand, includes all students of one or to classrooms within the sampled schools. In both surveys, however, the OTL measures relate to the content covered in the instruction provided to the students that took the mathematics test. The analyses will control for student background, operationalized through number of books at home. The relation between OTL and student achievement is assessed by means of a number of regression analyses. In these analyses, the number of books at home is included as a control variable. Prior analyses on PISA 2012 data have revealed a substantial correlation between OTL and family background (Schmidt et al., 2015). Raw correlations between achievement and OTL may therefore be confounded, due to the joint relation of OTL and achievement with family background. Unfortunately, there is little overlap in the questions on family background in the TIMSS and PISA questionnaires, but the question on numbers of books at home is nearly identical in both surveys. The only difference is an additional sixth response category in PISA. In order to realize maximum comparability, both the fifth and sixth response category in PISA have been collapsed into a single category. The resulting categories are: 1. 2. 3. 4. 5.

0–10 books 11–25 books 26–100 books 101–200 books more than 200 books

As all data are aggregated at the school level, the findings cannot be interpreted as estimated effects of books at home on individual achievement. They reflect the relation between the average background of the school population and average achievement per school. A positive coefficient does not necessarily imply that students with many books at home tend to get high test scores (Robinson, 1950). It is even conceivable that in schools with high numbers of books at home on average, the students with low numbers of books at home get high scores. This may not seem a realistic scenario, but in voting studies researchers may find that

11

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . .

231

support for anti-immigrant policies is relatively strong in districts with high percentages of immigrants. In such cases, it is obvious that the relation between immigrant status and political sympathies at the individual level is quite different from the relation at the aggregate level. All analyses made use of “plausible values.” In both the TIMSS and PISA datasets, individual student performance is presented by five plausible values instead of a single test score. Plausible values are a useful way to deal with assessment situations where the number of items administered is not large enough for precise estimation of individual ability. Based on the observed score a distribution is constructed of the student’s ability. The plausible values are random draws from this distribution and represent the range of abilities students might reasonably have (Wu & Adams, 2002). The use of plausible values prevents researchers from underestimating random error (i.e., overestimating precision) in their findings. The results to be reported are actually based on analyses that were repeated five times. The coefficients represent the means across the five plausible values and the corresponding standard errors were calculated in accordance with the guidelines specified in the PISA data analysis manual (OECD, 2009; pp. 100–115). Student weights (HOUWGT) were used to ensure that the results are representative for the student population. Appropriate estimation of the statistical significance of the obtained regression coefficient (which is based on student numbers) required an adjustment of the t-values obtained, as they should be based on the numbers of schools (not the number of students). The datasets that were used to conduct the analyses can be retrieved at the websites http://www.oecd.org/pisa/ and https:// timssandpirls.bc.edu/.

Results Content Validity: Comparison of the Content of the TIMSS and PISA OTL Measures OTL Measures in TIMSS and PISA In TIMSS, the mathematics teachers were asked to indicate for a range of topics within four mathematics domains (Number, Algebra, Geometry, Data and Chance): “which of the following options best described the situation for the students in the TIMSS survey”: • The topic had been mostly taught before this year • The topic had been mostly taught this year • The topic had not yet been taught or just introduced Examples of the topics included are concepts of fractions and computing with fractions (Number domain), simple linear equations and inequalities (Algebra domain) congruent figures and similar triangles (Geometry domain), and reading and displaying data using tables, pictographs, bar graphs, pie charts and line

232

H. Luyten and J. Scheerens

graphs (Data and Chance domain). See Table 2 for an overview of the topics. Based on these responses, indices were constructed that express on a scale from 0 to 100 to what extent each of the four domains had been covered. For the purpose of the analyses that are reported in the current chapter a general OTL measure was computed (the average of the OTL indices over the four domains). In PISA 2012, OTL data was obtained from the students instead of the teachers. Using the student responses, three indices on OTL with regard to mathematics were constructed. The analyses reported here focus on the experience with formal mathematics. The technical report (OECD, 2014b) discusses two additional OTL indices with regard to content covered. The first one relates to experience with applied mathematics (e.g., figuring out from a train schedule how long it would take to get from one place to another) and the second one to familiarity with a range of mathematical concepts (e.g., exponential functions). The main PISA 2012 report (OECD, 2014a) describes yet another OTL index (exposure to word problems), which also involves formal mathematics. In this case the mathematics is presented as a typical textbook problem and it will be clear to students which mathematical knowledge and skills are needed. In the current chapter we focus on the first index that captures experience with formal mathematics. Extensive empirical analyses showed that it has a remarkably strong relation with mathematics achievement (OECD, 2014a; pp. 145–174; Schmidt et al., 2015). The index is based on three questionnaire items. Students were asked to indicate for the following mathematics tasks how often they had encountered them during their time at school. • Solving an equation like 6x2 + 5 ¼ 29 • Solving an equation like 2(x + 3) ¼ (x + 3)(x – 3) • Solving an equation like 3x + 5 ¼ 17 The response categories were: frequently – sometimes – rarely – never. The PISA index was constructed by means of item response theory (IRT) scaling, which results in scores that may range from minus infinity to infinity with a zero mean (OECD, 2014b; p. 329). The first striking difference between the way OTL is measured in TIMSS vs. PISA relates to the data source (teachers vs. students). Another difference is presented by the curricular units on which both OTL measures are based. Whereas OTL in TIMSS is based on coverage of mathematical content categories (e.g., linear equations), OTL in PISA is based on exposure to specific test items (e.g., solving an equation like 3x + 5 ¼ 17). Besides these differences, some similarities between PISA and TIMSS should be noted as well. In both cases the focus is on mathematical content, whereas cognitive operations (e.g., knowing, applying, and reasoning) are not addressed. Furthermore, both measures focus on exposure rather than alignment. Table 1 provides an overview of the differences and similarities between PISA and TIMSS with regard to the operationalization of OTL.

11

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . .

233

Table 1 Differences and similarities in OTL measures in TIMSS and PISA Scope

Curricular unit Source Exposure/ alignment

TIMSS Focus on mathematical content; cognitive operations not addressed Mathematical content categories Teachers Focus on exposure (content covered yes/no)

PISA Focus on mathematical content; cognitive operations not addressed Specific test items* Students Focus on exposure (frequency)

*/ Content categories are addressed in the PISA student questionnaire as well, but the current chapter focuses on the OTL measure that relates to specific test items

Assessment/Analytical Frameworks The framework that guided the development of the mathematics test for the grade 8 students in TIMSS 2011 test (referred to as assessment framework) includes the following four main content domains: • • • •

Number Algebra Geometry Data and Chance

In addition, the following three cognitive domains are discerned (Mullis et al., 2009): • Knowing • Applying • Reasoning The four content domains relate to specific mathematical knowledge and skills. Each content domain includes several more specific topic areas. For example, one of the topics in the Number domain is Fractions and Decimals. Each topic is described in terms of specific skills and knowledge (e.g., convert between fractions and decimals). Table 2 presents an overview of the four content domains and the topic areas included in each domain. In addition to the content domains, the framework also includes three cognitive domains that describe the cognitive skills that students need to use when they do the mathematics tests. The first cognitive domain (knowing) relates to a student’s ability to recall knowledge, recognize shapes and expressions, retrieve information from graphs, and classify objects by attributes. The second cognitive domain (applying) involves the application of mathematical knowledge and procedures to problems that are fairly routine in nature. The reasoning domain involves the ability to find solutions to nonroutine problems by means of mathematical reasoning.

234

H. Luyten and J. Scheerens

Table 2 Coverage of topics in the TIMSS teacher questionnaire compared to the mathematics framework Coverage of topics addressed in the TIMSS teacher questionnaire A. Number a) Computing, estimating, or approximating with whole numbers b) Concepts of fractions and computing with fractions c) Concepts of decimals and computing with decimals d) Representing, comparing, ordering, and computing with integers e) Problem solving involving percent and proportion B. Algebra a) Numeric, algebraic, and geometric patterns or sequences (extension, missing terms, generalization of patterns) b) Simplifying and evaluating algebraic expressions c) Simple linear equations and inequalities d) Simultaneous (two variables) equations e) Representation of functions as ordered pairs, tables, graphs, words, or equations C. Geometry a) Geometric properties of angle and geometric shapes (triangles, quadrilaterals, and other common polygons) b) Congruent figures and similar triangles c) Relationship between three-dimensional shapes and their two-dimensional representations d) Using appropriate measurement formulas for perimeters, circumferences, areas, surface areas, and volumes e) Points on the Cartesian plane f) Translation, reflection, and rotation D. Data and Chance a) Reading and displaying data using tables, pictographs, bar graphs, pie charts, and line graphs b) Interpreting data sets (e.g., draw conclusions, make predictions, and estimate values between and beyond given data points) c) Judging, predicting, and determining the chances of possible outcomes

Content domains and topic areas in the mathematics framework Number Whole numbers Fractions and decimals Integers Ratio, proportion, and percent

Algebra Patterns

Algebraic expressions Equations/formulas and functions

Geometry Geometric shapes Geometric measurement Location and movement

Data and Chance Data organizations and representation Data interpretation

Chance

Like the TIMSS 2011 framework, the PISA 2012 framework (referred to as analytical framework) also includes four broad content categories that have served as a guide in the development of test items (OECD, 2013). These content categories match the four categories in the TIMSS framework quite closely and are labeled as follows:

11

• • • •

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . .

235

Quantity Change and Relationships Space and Shape Uncertainty and Data

Whereas in the TIMSS framework three cognitive domains are discerned, the PISA framework distinguishes three types of mathematical process. These processes describe how students connect the context of a problem with mathematical concepts. The three processes are described as follows: • Formulating situations mathematically • Employing mathematical concepts, facts, procedures, and reasoning • Interpreting, applying, and evaluating mathematical outcomes Also, in this regard a substantial degree of correspondence can be observed between the cognitive domains in the TIMSS framework and the mathematical processes in the PISA framework. Unlike the TIMSS framework, the PISA framework mentions the context to which the test items relate. In this respect four types are discerned: • • • •

Personal Occupational Societal Scientific

Whereas the TIMSS framework suggests that assignment of test items to a particular mathematical content domain is fairly straightforward, the discussion of the PISA framework emphasizes that mathematical problems in realistic contexts will often relate to more than a single content category. In the TIMSS framework, the four main content domains are hierarchically subdivided into more narrowly defined topic areas, which in turn describe specific knowledge and skills. Developing concrete test items is then the next logical step. The PISA framework lists a considerable number of content topics (see Table 3), but at the same time it is explicitly stated that there is no one-to-one mapping of the content topics to the broad content categories. Each content topic listed is assumed to relate to more than just one of the broad content categories. In the PISA framework, the four broad categories mainly serve as a guideline to ensure an appropriate variation of test items with regard to mathematical content.

Similarities Between the Assessment/Analytical Frameworks and the OTL Measures The frameworks that guide the development of the mathematics tests in TIMSS and PISA are largely similar (see Table 4 for a schematic comparison). Especially the four broad content domains/categories in both surveys hardly differ from each other. Also the distinctions between three cognitive domains/mathematical processes bear

236

H. Luyten and J. Scheerens

Table 3 Content topics listed in the PISA mathematics framework (OECD, 2013; p. 36) ▪ Functions: The concept of function, emphasizing but not limited to linear functions, their properties, and a variety of descriptions and representations of them. Commonly used representations are verbal, symbolic, tabular, and graphical ▪ Algebraic expressions: Verbal interpretation of and manipulation with algebraic expressions, involving numbers, symbols, arithmetic operations, powers, and simple roots ▪ Equations and inequalities: Linear and related equations and inequalities, simple seconddegree equations, and analytic and nonanalytic solution methods ▪ Co-ordinate systems: Representation and description of data, position, and relationships ▪ Relationships within and among geometrical objects in two and three dimensions: Static relationships such as algebraic connections among elements of figures (e.g., the Pythagorean Theorem as defining the relationship between the lengths of the sides of a right triangle), relative position, similarity and congruence, and dynamic relationships involving transformation and motion of objects, as well as correspondences between two- and three-dimensional objects ▪ Measurement: Quantification of features of and among shapes and objects, such as angle measures, distance, length, perimeter, circumference, area, and volume ▪ Numbers and units: Concepts, representations of numbers and number systems, including properties of integer and rational numbers, relevant aspects of irrational numbers, as well as quantities and units referring to phenomena such as time, money, weight, temperature, distance, area and volume, and derived quantities, and their numerical description ▪ Arithmetic operations: The nature and properties of these operations and related notational conventions ▪ Percents, ratios, and proportions: Numerical description of relative magnitude and the application of proportions and proportional reasoning to solve problems ▪ Counting principles: Simple combinations and permutations ▪ Estimation: Purpose-driven approximation of quantities and numerical expressions, including significant digits and rounding ▪ Data collection, representation, and interpretation: Nature, genesis, and collection of various types of data, and the different ways to represent and interpret them ▪ Data variability and its description: Concepts such as variability, distribution, and central tendency of data sets, and ways to describe and interpret these in quantitative terms ▪ Samples and sampling: Concepts of sampling and sampling from data populations, including simple inferences based on properties of samples ▪ Chance and probability: Notion of random events, random variation and its representation, chance and frequency of events, and basic aspects of the concept of probability

considerable resemblance. The main difference relates to the hierarchical nature of the frameworks. The TIMSS framework is unmistakably more hierarchical in character, as it implies a straightforward chain from general content domains via more detailed topic areas that describe specific knowledge and skills to concrete test items. In contrast, the PISA framework explicitly rejects a strictly hierarchical structure. The coherence of mathematical knowledge is emphasized. It hardly considered feasible to formulate items that relate exclusively to a single content area. Consequently, a one-to-one mapping of content topics to the broad content categories is not deemed realistic.

11

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . .

237

Table 4 Comparison of TIMSS and PISA frameworks TIMSS assessment framework Content domains Number Algebra Geometry Data and Chance Cognitive domains Knowing Applying Reasoning

PISA analytical framework Content categories Quantity Change and Relationships Space and Shape Uncertainty and Data Mathematical processes Formulating situations mathematically Employing mathematical concepts, facts, procedures, and reasoning Interpreting, applying, and evaluating mathematical outcomes

Key Findings: Interpretation of the Content Comparison with an Eye to Content Validity and Concurrent Validity The OTL measurements in PISA and TIMSS clearly differ in some key respects. The most obvious divergence relates to the data source. Whereas TIMSS relies on teacher reports, information on OTL is provided by the students in PISA. In addition, there is a difference in the curricular unit on which the OTL measure is based. In TIMSS the information relates to specific mathematics topics. In PISA the students indicate how frequently they have encountered three concrete mathematics tasks during their time at school. The topics covered in the TIMSS OTL measure largely coincide with the topics listed in the TIMSS 2011 Assessment Frameworks (Mullis et al., 2009). Table 2 provides a comparison of the topics listed in the TIMSS assessment framework and the topics addressed in the teacher questionnaire on content covered. Although both lists are not exactly identical, the high degree of correspondence between the assessment framework and the teacher questionnaire is unmistakable. It can be safely concluded that the OTL measure in TIMSS 2011 reflects the content of the mathematics test more completely than is the case in PISA 2012. The TIMSS study design is based on a straightforward mapping from the topics in the assessment framework to the topics covered in the OTL measure. As a result, the TIMSS OTL measure presents a careful coverage of all relevant topics covered in the test. This seems much more dubious in PISA, in which the OTL measure covers a rather restricted subset of all topics covered in the Analytical Framework (OECD, 2013). The OTL measure in TIMSS is based on 19 questionnaire items, whereas the measure in PISA relates to only 3. Moreover, the items that make up the PISA OTL measure with regard to formal mathematics all relate to solving algebraic equations. Finally, it should be noted that the PISA OTL measure is based on student reports and strictly speaking only expresses the student recollections of exposure to certain mathematical tasks. It seems likely that such recollections do not exclusively reflect actual exposure, but that they are also influenced by other factors (e.g., quality of instruction, learning aptitudes, and work attitudes). On the other hand, it should also be acknowledged that teacher reports, on which the OTL measure in TIMSS is

238

H. Luyten and J. Scheerens

based, may suffer from bias and/or lack of reliability. These issues will be discussed more extensively in the discussion section of this chapter.

Interpretation of the Comparison in Terms of Face Validity Face validity assesses whether the test "looks valid" to the examinees who take it, the administrative personnel who decide on its use, and other technically untrained observers. The description and analysis of the TIMSS and PISA OTL measures has addressed the way test items are logically connected with higher level and broader content categories, included in the analytic frameworks of the two measures. This comparison approaches an assessment of the content validity of the two measures. In psychometrics, content validity (also known as logical validity) refers to the extent to which a measure represents all facets of a given construct. Both the TIMSS and the PISA OTL measures have explicit deductive schemes on the basis of which questionnaire items are ultimately selected. As described in the above, in both cases the logic is specified but the frames are quite different from one another. Since the content comparison has been done in a relatively global way, it is rather to be labeled as a “content related” facet of face validity than a formal study of content validity. In addition, the comparison has focused on other features of the two measures, such as the scope of the domain covered, the number of items, and the respondents. Comparison of this kind approaches the aims of concurrent validity. Concurrent validity is demonstrated when a test correlates well with a measure that has previously been validated. The two measures may relate to the same construct, but are more often used for different, but presumably related, constructs. But here too, the comparison does not qualify as a full-fledged study of concurrent validity, because the analysis is informal, non-quantitative, and, moreover, it is not so clear which of the two measures should be considered as “previously validated,” although the TIMSS OTL measures have a longer history. Therefore, the descriptive comparison of the two measures is, once more, just to be seen as a differently oriented facet of face validity. This time the comparison is to be seen as a “concurrent validity inspired” facet of face validity. In the next section the quantitative analysis will partially address the concurrent validity, by examining the correlation between the two measures at country level.

Convergent Validity: The Association of OTL with Student Performance in TIMSS and PISA Data Analysis Before reporting the main findings, descriptive statistics and correlations of country means are presented. Table 5 presents TIMSS and PISA information per country on the number of schools, average student achievement, and the average score regarding the questionnaire item on number of books at home (with categories ranging from 1 to 5). In total 25 countries are included. These are the countries that participated in both PISA 2012 and TIMSS 2011 with a grade 8 population. There

Country Australia Chile Finland Hong Kong Hungary Indonesia Israel Italy Japan Korea (South) Lithuania Malaysia Norway New Zealand Qatar Romania Singapore Slovenia Sweden Thailand Tunisia Turkey

Number of Schools TIMSS PISA 227 775 184 221 142 311 115 148 145 204 153 209 151 172 188 1194 138 191 148 156 141 216 180 164 132 197 154 177 107 157 146 178 165 172 181 338 142 209 171 239 205 153 238 170

Table 5 Basic statistics per country Mathematics score TIMSS PISA 505 504 416 423 514 519 586 561 505 477 386 375 516 466 498 485 570 536 613 554 502 479 440 421 475 489 488 500 410 376 458 445 611 573 505 501 484 478 427 427 425 388 452 448

Books at home (1–5) TIMSS PISA 3.27 3.40 2.51 2.42 3.29 3.33 2.69 2.79 3.21 3.46 2.70 2.30 3.12 3.29 3.03 3.12 2.93 3.35 3.61 3.80 2.78 2.89 2.19 2.84 3.36 3.48 3.14 3.33 2.75 2.81 2.50 2.70 2.82 3.05 2.91 2.92 3.23 3.38 2.08 2.37 2.17 2.00 2.48 2.45

OTL score TIMSS 78.9 71.5 55.6 81.7 85.9 63.5 88.4 80.7 89.6 90.9 70.2 81.8 51.9 78.2 86.2 93.8 87.8 67.4 60.0 76.0 67.4 94.7

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . . (continued)

PISA 0.165 0.102 0.003 0.149 0.140 0.153 0.011 0.219 0.193 0.428 0.133 0.021 0.005 0.270 0.282 0.067 0.331 0.199 0.251 0.090 0.302 0.104

11 239

Country Taiwan United Arab Emirates United States of America Average across countries

Table 5 (continued)

Number of Schools TIMSS PISA 150 163 430 458 384 162

Mathematics score TIMSS PISA 609 560 456 434 509 481 494 476

Books at home (1–5) TIMSS PISA 3.00 3.13 2.64 2.72 2.94 2.83 2.85 2.97

OTL score TIMSS 71.2 78.9 90.6 77.7

PISA 0.040 0.097 0.093 0.002

240 H. Luyten and J. Scheerens

11

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . .

241

are two exceptions: Kazakhstan and Russia are not included, as in these countries no data on OTL were collected. The number of schools per country included in the surveys is in most cases a little higher in PISA (with the exception of Malaysia, Tunisia, Turkey, and the United States). Usually the number of schools in PISA and TIMSS are not radically different, although in three cases the number in PISA is more than twice as large as the number of schools in TIMSS (Australia, Italy, and Slovenia). In one case the opposite applies: the number of American schools in TIMSS is twice as large as it is in PISA. The appendix provides details with regard to the TIMSS surveys on the numbers of mathematics teachers per school that were involved and the number of students per school in PISA. The appendix shows with regard to TIMSS that in most cases less than two teachers per school provided information on OTL (the exceptions are: Australia, Israel, Korea, New Zealand, Singapore, Slovenia, and Sweden). In Malaysia the number of teachers per school is exactly one and in six countries the average number of teachers per school is below 1.10 (Chile, Italy, Thailand, Tunisia, Turkey, and Taiwan). In most countries the information on OTL per school is usually based on the report by only one teacher. The median across countries is 1.46 teachers per school, (the average is 1.75). The number of students per school in PISA ranges from 17.5 in Slovenia to 69.8 in Qatar. The median across countries is 28.5 students per school (the average is 29.2). Some of the countries selected for this study score very high on mathematics in both TIMSS and PISA (e.g., Singapore and Korea), but some countries that score far below the international mean (e.g., Tunisia and Qatar) in both surveys are included as well. Table 5 shows a great deal of consistency between TIMSS and PISA. Countries with high average achievements in one survey tend to show high scores in the other as well. With regard to the number of books at home, a similar consistency between surveys can be observed. Korea and Norway show high averages in both surveys, whereas the opposite goes for Tunisia and Thailand. In these two countries the average score on a scale from 1 to 5 hardly exceeds 2, which suggests a little over 11 books at home on average. In Korea and Norway, the average is well above three in both surveys. This suggests over 25 books at home but probably closer to 100. With regard to OTL the country averages seem much less consistent. Figures 1, 2 and 3 provide graphic displays (scatterplots) of the correlations between country means in PISA and TIMSS. The correlation is extremely high (.953) for mathematics achievement. For number of books at home the correlation is a little less (.893), but still very high. This strongly suggests that both surveys measure largely the same with regard to student achievement and number of books at home (at country level). However, with regard to OTL the picture is radically different. The correlations between PISA and TIMSS is much weaker (.303) and not statistically significant at α ¼ .05. Considering these findings, it hardly seems justified to assume that both OTL measures relate to a common underlying concept.

Key Findings with Regard to Convergent Validity Table 6 shows the key findings on the association between OTL and student performance in PISA and TIMSS. Per country a regression analysis has been

242

H. Luyten and J. Scheerens 575

r = .953

PISA 2012

550 525 500 475 450 425 400 400

425

450

475

500

525

550

575

600

625

TIMSS 2011

PISA 2012

Fig. 1 Correlation between country means on mathematics achievement in PISA and TIMSS 3.8 3.6 3.4 3.2 3.0 2.8 2.6 2.4 2.2 2.0

r = .893

2.0

2.2

2.4

2.6

2.8

3.0

3.2

3.4

3.6

3.8

TIMSS 2011

PISA 2012

Fig. 2 Correlation between country means on number of books at home in PISA and TIMSS 0.5 0.4 0.3 0.2 0.1 0.0 -0.1 -0.2 -0.3 -0.4

r = .303

50

55

60

65

70

75

80

85

90

95

100

TIMSS 2011

Fig. 3 Correlation between country means on OTL in PISA and TIMSS

conducted on both the TIMSS and PISA data. The statistical relation of the OTL measures with mathematics achievement, controlling for number of books at home, has been assessed for 25 different countries. The analyses show large differences between TIMSS and PISA, but confirm the previously reported findings on the

Australia Chile Finland Hong Kong Hungary Indonesia Israel Italy Japan Korea (South) Lithuania Malaysia Norway New Zealand Qatar Romania

OTL TIMSS Coeff. 0.150 -0.034 -0.066 -0.064 0.055 0.142 0.008 0.107 -0.019 0.032 -0.117 -0.040 0.117 0.232 -0.343 0.041

* *** ***

*

*

*

***

S.E. 0.040 0.047 0.076 0.064 0.048 0.074 0.073 0.063 0.071 0.053 0.058 0.045 0.070 0.052 0.064 0.057

PISA Coeff. 0.368 0.424 0.319 0.546 0.196 0.302 0.512 0.378 0.561 0.407 0.368 0.530 0.322 0.368 0.813 0.119 *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *

S.E. 0.026 0.039 0.048 0.048 0.044 0.062 0.054 0.020 0.044 0.051 0.046 0.056 0.063 0.044 0.042 0.059

BOOKS at home TIMSS Coeff. 0.720 *** 0.769 *** 0.464 *** 0.703 *** 0.837 *** 0.336 *** 0.488 *** 0.519 *** 0.570 *** 0.748 *** 0.693 *** 0.790 *** 0.604 *** 0.697 *** 0.730 *** 0.722 *** S.E. 0.040 0.047 0.076 0.064 0.048 0.074 0.073 0.063 0.071 0.053 0.058 0.045 0.070 0.052 0.064 0.056

PISA Coeff. 0.495 0.555 0.383 0.452 0.743 0.393 0.382 0.531 0.428 0.540 0.567 0.364 0.390 0.617 0.134 0.699

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***

(continued)

S.E. 0.026 0.039 0.047 0.048 0.044 0.062 0.054 0.021 0.044 0.051 0.046 0.057 0.063 0.044 0.042 0.059

Table 6 Regression analyses, standardized regression coefficients of books at home and OTL; Mathematics school means regressed on books at home and OTL

11 Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . . 243

OTL TIMSS 0.059 -0.026 0.068 0.059 -0.048 0.098 0.079 -0.025 0.118 0.023

* significant at α ¼ .05 (one-tailed) ** significant at α ¼ .01 (one-tailed) *** significant at α ¼ .001 (one-tailed)

Singapore Slovenia Sweden Thailand Tunisia Turkey Taiwan United Arab Emirates United States of America Average across countries

Table 6 (continued)

**

* *

0.048 0.063 0.055 0.057 0.049 0.043 0.047 0.040 0.031

PISA 0.238 0.219 0.204 0.301 0.587 0.396 0.191 0.508 0.291 0.379 *** *** *** *** *** *** *** *** ***

0.051 0.038 0.057 0.058 0.054 0.059 0.049 0.033 0.044

BOOKS at home TIMSS 0.769 *** 0.563 *** 0.720 *** 0.674 *** 0.695 *** 0.716 *** 0.803 *** 0.499 *** 0.721 *** 0.662 0.048 0.063 0.055 0.057 0.049 0.043 0.046 0.040 0.031

PISA 0.673 0.677 0.539 0.461 0.360 0.516 0.737 0.410 0.689 0.509

*** *** *** *** *** *** *** *** ***

0.051 0.038 0.057 0.058 0.054 0.059 0.049 0.087 0.044

244 H. Luyten and J. Scheerens

11

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . .

245

relation between mathematics achievement and exposure to formal mathematics in PISA 2012 (OECD, 2014a, b; Schmidt et al., 2015). In each and every of the 25 countries listed in Table 6 the OTL regression coefficients are positive and statistically significant. In 24 countries the coefficients are significant at the .001 level (one-tailed). The standardized OTL coefficients in PISA range from .119 (Romania) to .813 (Qatar). The average value across countries is .379, which must be considered a large effect, as it corresponds to an effect size that approximately equals .80 (Cohen, 1988). In contrast, the OTL coefficients in TIMSS are much smaller. The average across countries amounts to a modest value of .023. This corresponds to an effect size below .05 (Cohen, 1988) and must be considered as very small by all means. It is also much smaller than the average OTL effect of about .30 of the meta- analyses mentioned in the introductory section of this chapter. An effect size equal to .30 (in terms of Cohen’s d) would correspond to a correlation or a standardized regression coefficient that amounts to about .15. A considerable portion of the regression coefficients (10 out of 25) is even actually negative, which would imply that mathematics achievement decreases as the amount of content covered increases, although only two of the negative coefficients are statistically significant (at the .05 level, one-tailed). These relate to Qatar and Lithuania. In 15 countries the coefficient is positive, 8 of which are statistically significant (at the .05 level, one-tailed). This amounts to a percentage of 32% positive and significant effects of OTL on achievement in TIMSS, which is somewhat below the percentage of 44% obtained in the recent vote-count analysis by Lamain et al. (2017). The analyses also show consistently strong effects of number of books at home on student achievement. The effects of this variable are significant (at α < .001; one-tailed) and positive in each and every country. This goes for both PISA and TIMSS. These findings indicate that OTL in mathematics tends to be relatively high in schools where the average number of books at home is high as well. In both PISA and TIMSS the coefficients for number of books at home show higher values than the OTL coefficients.

Discussion The following conclusions clearly stand out from the findings reported in the present chapter: • The content validity of the OTL measure that has been developed in TIMSS seems high, given the close correspondence between the mathematics content included in the TIMSS assessment framework and the OTL measure. • The concurrent validity looks questionable given the important differences between the measures developed in TIMSS and PISA regarding the scope of the domain covered, the number of items, and the respondents (teachers in TIMSS and students in PISA). The actual correlation between the two measures was as low as .303, which confirms the conclusion of questionable concurrent validity.

246

H. Luyten and J. Scheerens

• The statistical association between OTL and student achievement is disappointingly weak in TIMSS. This obviously suggests a lack of convergent validity. • Although the convergent validity of the OTL measure in PISA seems high (given its remarkably strong and consistent statistical association with mathematics achievement in 25 different countries), the content validity of this measure remains open to question. In comparison to TIMSS, OTL in PISA covers no more than a highly restricted subset of all the mathematical content that is covered in the mathematics test. It should be mentioned that in addition to the OTL measure in PISA that the current chapter has focused on, other OTL measures have been constructed on the basis of the PISA data. These relate to experience with applied mathematics, word problems, and familiarity with a number of mathematical concepts. However, prior analyses have indicated that the measure focusing exclusively on the coverage of formal mathematics clearly shows the strongest association with mathematics achievement (OECD, 2014a, b; Schmidt et al., 2015). The current study unmistakably confirmed the strong relation between exposure to formal mathematics (as reported by the students) and mathematics achievement. This is a striking result as the OTL measure on formal mathematics is actually based on exposure to no more than three mathematical tasks that exclusively relate to solving algebraic equations. In addition to doubts about the content validity, some other issues with regard to the OTL measure in PISA should be mentioned. It seems possible that the student responses do not only reflect the amount of content covered in the mathematics classes, but a number of other factors as well. It is important to note that the PISA OTL measure is based on student reports on how frequently they had encountered a number of specific mathematics tasks. Strictly speaking the measure expresses the students’ recollections of exposure to certain mathematical tasks. It can be argued that such recollections do not just reflect actual exposure, but that they are also influenced by the quality of instruction. It is conceivable that students have been exposed to some mathematical content, but that they hardly recall any of this because the quality of instruction was poor. Recollection may also be determined by the students’ learning aptitudes, their prior knowledge, or their work attitudes. These factors may also moderate their recollections, even if the quality of instruction was high. Slow learners are less likely to remember what was covered in class than fast learners. Instruction will hardly be effective if the students lack sufficient prior knowledge. In such cases, they are less likely to recall what kind of tasks they have encountered. Finally, it seems dubious if students will remember what has been taught, if they were hardly motivated to learn (Caroll, 1963, 1989). All in all, there are many reasons for reservations about the validity of the OTL measure used in PISA. On the other hand, it must be emphasized that the aforementioned factors might affect the OTL measures. It is possible that in addition to actual exposure, recollected exposure also reflects quality of instruction, learning aptitudes, prior knowledge, and work attitudes, but further research should reveal to what extent such concerns are actually warranted.

11

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . .

247

It should also be noted that the OTL measure in PISA may be less prone to random error (in other words: it may be more reliable), because it is based on student reports. OTL in TIMSS is typically based on the reports by one or two mathematics teacher per school (see the appendix for details). In contrast, the OTL measure in PISA is based on about 29 students per school (i.e., the mean/median across countries, see the appendix). This suggests that the OTL measures are considerably more reliable in PISA than in TIMSS. Low reliability of measurements decreases the chances of detecting a relationship between the variables involved. In other words: even if the relationship exists in the real world, the chances to detect them in empirical research become small. In addition to concerns about the reliability of the OTL measure in TIMSS, it should also be noted that teacher responses may be biased to some extent. It seems likely that at least some teachers will provide socially desirable responses and report what they are supposed to have taught rather than what they actually taught. Moreover, teachers may not always be able to provide accurate information about the content that has been taught to their students in prior grades. These considerations call for a closer examination of the validity of the OTL measures in both TIMSS and PISA. The best way to settle the unresolved issues with respect to content, concurrent, and convergent validity of OTL measures would be to assess the correlation of the student and teacher reports with more objective measures of OTL. OTL measures, more objective than the ones used in TIMSS and PISA, are definitely feasible, but they do not seem suitable for use in large-scale surveys. More precise and objective information on content covered could be obtained by means of classroom observations. Keeping detailed logs (by teachers, students, or both) might be a useful alternative as well. Comparing teacher and student reports on OTL with either observations or detailed logs would probably shed more light on the validity of student and teacher reports. It should be noted, though, that such a study would require a careful sampling of observations across the school year as it does not seem feasible to observe all lessons during an entire school year. Neither does it seem realistic to expect that teachers and (especially) students will keep accurate and detailed logs for a prolonged period. On the other hand, computer-based learning opens up new possibilities for analyzing log-files that provide details on the time a student spent on specific tasks (e.g., Faber et al., 2017). It would also be informative to compare OTL measures of both teachers and students that relate to the same lessons. This would reveal to what extent students and teachers agree on the content covered. Such a study design could, in principle, be applied in large-scale studies like TIMSS and PISA. In that case, it would probably be appropriate to focus on concrete test items rather than abstract mathematics topics (like “geometric measurement” or “algebraic expressions”). Research on agreement among students within schools or classes would be highly informative as well. If it seems reasonable to assume that the content covered does not vary between students within classes, lack of agreement between students would suggest a lack of measurement reliability. An assessment of the agreement between students within schools would be possible by means of a secondary analysis of PISA data. In that

248

H. Luyten and J. Scheerens

case it should be taken into account that lack of agreement within schools may also result from genuine variation within schools (e.g., between classes or grades). An analysis on agreement with regard to OTL between teachers within schools could be conducted on TIMSS data, in any case in schools where at least two teachers have provided reports on content covered. Whereas the relation of OTL with student achievement in TIMSS is surprisingly weak in comparison to findings from previous meta-analyses (Scheerens, 2016; Lamain, Scheerens & Noort, 2017), the opposite goes for the OTL measure in PISA. The correlation of this measure with student achievement seems almost “too good to be true.” This raises the question whether the observed relation between OTL and achievement in PISA is somehow an overestimation of the real relation. It certainly seems possible that the student reports of content covered are confounded with other factors that are likely to have an impact on student achievement. It definitely would be feasible to assess the impact of some of the possibly confounding variables (especially student attitudes and quality of instruction) in further analyses of the PISA data. Such analyses should detect if the relation between OTL and achievement weakens, if student attitudes and quality of instruction are included as covariates. It would also be commendable to check for interaction effects of OTL with these control variables, as it seems plausible that variables like student attitudes and quality of instruction can moderate the effect of OTL. One might even attempt to test the Caroll model (Caroll, 1963, 1989), which posits five factors to account for variation in school learning (prior achievement, learning aptitudes, quality of instruction, perseverance, and OTL), in strictly controlled laboratory experiments. For example, a setting in which respondents need to learn new content in a short time span (a few hours). In such settings, one can more easily manipulate aspects like OTL (content covered) and quality of instruction. Moreover, monitoring student perseverance would be relatively straightforward (e.g., through observation or computer-logs). Even prior knowledge may be prone to manipulation.

Conclusion The current chapter has come up with several intriguing results that call for closer examination. First, given the differences between PISA and TIMSS, it looks as though student reports of content covered are more closely related to student achievement than teacher reports. Another suggestion that arises is that a measure of exposure to specific content (e.g., formal mathematics) may show a stronger relation with student achievement than a measure that captures whether all content that is covered in a test was taught in school. This would imply that coverage of certain key content is more closely related to achievement than exposure to each and every content domain that is covered in the test. Finally, it seems that an OTL measures based on specific tasks is more closely related to achievement than a measure that refers to more abstract concepts. However, these suppositions will

11

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . .

249

not hold ground, if further analyses show that student reports on OTL also reflect quality of instruction, student attitudes, aptitudes, and prior knowledge. Finally, we should acknowledge that we have not attempted a broader “interpretative” assessment of the theory of action on OTL, embedded in our alignment model. Yet, this framework has exposed some relevant distinctions and seeming contradictions, which warrant further study. The alignment model draws attention to different reference points for constructing OTL measures, which could be national standards, taxonomies of educational objectives, or assessed content. Next, the framework drew attention to the difference between alignment indices solely depending on matching content of documents representing curricular elements and assessments on the one hand, and behavioral measures: exposure to intended content and actual achievement measures, on the other hand. Finally, “intended misalignment” presents an interesting issue for debate. For one thing, because it would allow antagonists of educational testing to claim that unmeasurable skills and competencies are the most important.

References Bruggencate, C. G., Pelgrum, W. J., & Plomp, T. (1986). First results of the second IEA science study in the Netherlands. In W. J. Nijhof & E. Warries (Eds.), Outcomes of educationand training. Swets & Zeitlinger. Carmines, E. G., & Zeller, R. A. (1979). Reliability and validitity assessment. SAGE. https://doi. org/10.4135/9781412985642 Caroll, J. B. (1963). A model of school learning. Teachers College Record, 64, 722–733. Caroll, J. B. (1989). The Caroll model, a 25-year retrospective and prospective view. Educational Researcher, 18, 26–31. Cohen. (1988). Statistical power analysis for the behavioural sciences (2nd ed.). Lawrence Erlbaum. Faber, J. M., Luyten, H., & Vischer, A. (2017). The effects of a digital formative assessment tool on mathematics achievement and student motivation: Results of a randomized experiment. Computers & Education, 106, 83–96. https://doi.org/10.1016/j.compedu.2016.12.001 Guilford, J.P. (1946). New standards for test evaluation. First Published December 1, 1946. Research Article .https://doi.org/10.1177/001316444600600401 Hattie, J. (2009). Visible learning. Routledge. Horn, A., & Walberg, H. J. (1984). Achievement and interest as a function of quantity andquality of instruction. Journal of Educational Research, 77, 227–237. Husen, T. (1967). International study of achievement in mathematics: A comparison of twelve countries. Wiley. Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527–535. https://doi.org/10.1037/0033-2909.112.3.527 Kane, M. (2006). Content-related validity evidence in test development. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 131–153). Lawrence Erlbaum Associates Publishers. Kane, M. T. (2008). Terminology, emphasis, and utility in validation. Educational Researcher, 37 (2), 76–82. https://doi.org/10.3102/0013189X08315390 Koretz, D. M., McCaffrey, D. F., & Hamilton, L. S. (2001). Towards a framework for validating under high-stakes conditions. CSE Technical Report, 551. Center for the Study of Evaluation. http://cresst.org/wp-content/uploads/TR551.pdf

250

H. Luyten and J. Scheerens

Kurz, A., Elliott, S. N., Wehby, J. H., & Smithson, J. L. (2010). Alignment of the intended, planned, and enacted curriculum in general and special education and its relation to student achievement. The Journal of Special Education, 44(3), 131–145. Kyriakides, L., Christoforou, C., & Charalambous, C. I. (2013). What matters for student learning outcomes: A meta-analysis of studies exploring factors of effective teaching. Teacher and Teacher Education, 36, 143–152. Lamain, M., Scheerens, J., & Noort, P. (2017). Review and “vote-count” analysis of OTL-effect studies. In J. Scheerens (Ed.), Opportunity to learn, curriculum alignment and test preparation, a research review. Springer. Luyten, H. (2017). Predictive power of OTL measures in TIMSS and PISA. In J. Scheerens (Ed.), Opportunity to learn, curriculum alignment and test preparation, a research review. Springer. Marzano. (2003). What works in schools. Translating research into action. Association for Supervision and Curriculum Development. Mehrens, W. (1997). The consequences of consequential validity. Educational Measurement: Issues and Practice, 16(2), 16–8. Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Educational measurement: Issues and practice., 14(4), 5–8. Mullis, I. V. S., Martin, M. O., Ruddock, G. J., O’Sullivan, Y., & Preuschoff, C. (2009). TIMSS 2011 assessment frameworks. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. https://timss.bc.edu/timss2011/downloads/TIMSS2011_Frameworks.pdf Mullis, I. V. S., Martin, M. O., Foy, P., & Arora, A. (2012). TIMSS 2011 international results in mathematics. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College. OECD. (2009). PISA data analysis manual SPSS Second Edition. OECD. https://doi.org/10.1787/ 9789264056275-en OECD. (2013). PISA 2012 Assessment and analytical framework: Mathematics, reading, science, problem solving and financial literacy. OECD Publishing. https://doi.org/10.1787/ 9789264190511-en OECD. (2014a). PISA 2012 results: What students know and can do – student performance in mathematics, reading and science (Volume I, Revised edition, February 2014). OECD Publishing. https://doi.org/10.1787/9789262408780-en OECD. (2014b). PISA 2012 technical report. OECD. https://www.oecd.org/pisa/pisaproducts/ PISA-2012-technical-report-final.pdf Pelgrum, W. J., Th. Eggen, J. H. M., & Plomp, T. J. (1983). The second mathematics study: Results. Twente University. Polikoff, M. S., & Porter, A. C. (2014). Instructional alignment as a measure of teaching quality. Educational Evaluation and Policy Analysis, 20, 1–18. https://doi.org/10.3102/ 0162373714531851 Porter, A., McMaken, J., Hwang, J., & Yang, R. (2011). Common core standards: The new U.S. intended curriculum. Educational Researcher, 40(3), 103–116. http://journals.sagepub. com/doi/pdf/10.3102/0013189X11405038 Porter, A. C., & Smithson, J. L. (2001). Defining, developing, and using curriculum indicators. Consortium for Policy Research in Education University of Pennsylvania Graduate School of Education. https://www.cpre.org/sites/default/files/researchreport/788_rr48.pdf Robinson, W. S. (1950). Ecological correlations and the behavior of individuals. American Sociological Review, 15(3), 351–357. Scheerens, J. (2016). Educational Effectiveness and Ineffectiveness. A critical review of the knowledge base. Springer. http://www.springer.com/gp/book/9789401774574 Scheerens, J. (Ed.). (2017). Opportunity to learn, curriculum alignment and test preparation, a research review. Springer. http://www.springer.com/gp/book/9783319431093 Scheerens, J., & Bosker, R. J. (1997). The foundations of educational effectiveness. Elsevier Science Ltd.

11

Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be. . .

251

Scheerens, J., Luyten, H., Steen, R., & Thouars, Y. L.-d. (2007). Review and meta-analyses of school and teaching effectiveness. University of Twente, Department of Educational Organisation and Management. Schmidt, W. H. (2009). Exploring the relationship between content coverage and achievement: Unpacking the meaning of tracking in eighth grade mathematics. Michigan State University. http://education.msu.edu/epc/forms/Schmidt_2009_Relationship_between_Content_Cover age_and_Achievement.pdf Schmidt, W. H., Burroughs, N. A., Zoido, P., & Houang, R. H. (2015). The role of schooling in perpetuating educational inequality: An international perspective. Educational Researcher, 20(10), 1–16. http://journals.sagepub.com/doi/pdf/10.3102/0013189X15603982 Schmidt, W. B., Cogan, L. S., Houang, R. T., & McKnight, C. C. (2011). Content coverage across countries/states: A persisting challenge for US educational policy. American Journal of Education, 117(May, 2011), 399–427. Schmidt, W. B., McKnight, C. C., Houang, R. T., Wiley, D. E., Cogan, L. S., & Wolfe, R. G. (2001). Why schools matter. A cross-national comparison of curriculum and learning. Jossey-Bass. Seidel, T., & Shavelson, R. J. (2007). Teaching effectiveness research in the past decade: The role of theory and research design in disentangling meta-analysis results. Review of Educational Research, 77(4), 454–499. Wolming, S., & Wikstrӧm, C. (2010). The concept of validity in theory and practice. Assessment in Education: Principles, Policy & Practice, 17(2), 117–132. https://doi.org/10.1080/ 09695941003693856 Wu, M. & Adams, R.J. (2002, April 6–7). Plausible values – why they are important. Paper presented at the international objective measurement workshop. New Orleans.

Using ILSAs to Promote Quality and Equity in Education: The Contribution of the Dynamic Model of Educational Effectiveness

12

Leonidas Kyriakides, Charalambos Y. Charalambous, and Evi Charalambous

Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Brief History of International Large-Scale Assessment Studies in Education . . . . . . . . . . . . . . ILSAs Conducted by the IEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ILSAs Carried Out by the OECD Measuring Student Learning Outcomes . . . . . . . . . . . . . . . . A Brief Historical Overview of the Educational Effectiveness Research . . . . . . . . . . . . . . . . . . . . . . Connections Between ILSA and EER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Dynamic Model of Educational Effectiveness: An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advancements of ILSAs by Making Use of the Dynamic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

254 255 255 258 259 262 264 268 272

Abstract

This chapter argues that during the past 50 years, International Large-Scale Assessment studies (ILSAs) in education and Educational Effectiveness Research (EER) have contributed to each other in significant ways. It is also claimed that in the future, stronger links can be established between these domains which can help advance work in both, while at the same time better serve their common agenda in promoting quality and equity in education. To support these arguments, we trace the historical roots of both strands of work and point to ways in which each has contributed to the other. It is shown that ILSAs have mobilized the

This chapter builds on and extends a previous work of the first two authors Kyriakides and Charalambous (Educational effectiveness research and international comparative studies: Looking back and looking forward. In R. Strietholt, W. Bos, J.-E. Gustafsson, & M. Rosén (Eds.), Educational policy evaluation through international comparative assessments (pp. 33–50). Waxmann, 2014). L. Kyriakides (*) · C. Y. Charalambous · E. Charalambous Department of Education, University of Cyprus, Nicosia, Cyprus e-mail: [email protected]; [email protected]; [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_13

253

254

L. Kyriakides et al.

resources for conducting educational effectiveness studies and afforded EER scholars the opportunity to capitalize on rich international datasets to conduct secondary analyses which have facilitated the theoretical and methodological development of EER. Additionally, by conducting both across- and withincountry analyses, researchers have examined whether specific factors at the classroom and/or the school level are generic while others are more countryspecific. On the other hand, EER has afforded theoretical constructs and frameworks to inform the design of ILSAs. For example, due to the influence of recent meta-analyses of effectiveness studies pointing to the importance of attending to teaching behaviors in the classroom, ILSAs are gradually making a step toward incorporating more such investigations. Forging stronger links between the two domains can advance work in each. We argue that from a theoretical perspective, ILSAs can be informed by the gradual shift currently witnessed in EER to move from a static to a more dynamic conceptualization of the process of teaching/ schooling/learning. Specifically, the dynamic model of educational effectiveness could be a foundation upon which ILSAs could be based, both in terms of their design and in terms of secondary data analysis which can help better understand what contributes to student learning and through that develop reform policies to promote quality and equity in education. Keywords

International Large-Scale Assessment studies in education · Educational Effectiveness Research · Dynamics of education · Quality and equity in education

Introduction Aiming to start unpacking and understanding problems of school and student evaluation, in 1958, a group of educational psychologists, sociologists, and psychometricians met in Hamburg, Germany, forming a cross-national enterprise devoted to comparative studies of school practices and achievement, known as the International Association for the Evaluation of Educational Achievement (IEA) (Purves, 1987). This year can be considered to mark the inauguration of International Large-Scale Assessment (ILSA) studies in education. At about the same time, a study undertaken by Coleman et al. (1966) reported that only a very small proportion of the variation in student achievement can be attributed to schools; this finding sparked heated debates on the role of schools and teachers to student learning, thus giving birth to Educational Effectiveness Research (EER). A little more than half of a century after the inauguration of the ILSA and almost about 50 years after the issuing of the Coleman report which motivated the development of the EER, in this chapter we argue that ILSA in education and EER have contributed to each other in significant ways. It is also claimed that in the future, stronger links should be established between these two fields of educational research which can help advance work in both, while at the same time contribute to the promotion of quality and equity in

12

Using ILSAs to Promote Quality and Equity in Education: The Contribution. . .

255

education. To support these arguments, we trace the historical roots of both strands of work and point to ways in which each has contributed to the other. We also argue that from a theoretical perspective, ILSA can be informed by the gradual shift currently witnessed in EER to move from a static to a more dynamic conceptualization of the process of teaching and schooling. Specifically, we claim that the dynamic model of educational effectiveness (Creemers & Kyriakides, 2008) could be a foundation upon which ILSAs in education could be based, both in terms of their design and in terms of secondary data analysis which can help better understand what contributes to student learning and through that develop reform policies to promote quality and equity in education. This chapter is, therefore, organized in five sections. In the first two sections, we provide a brief historical overview of ILSA and EER, correspondingly. Our aim in this overview is not to provide a comprehensive review of the historical development of either domain but rather to identify areas of convergence. In the third section, we point to areas in which each domain has contributed to the other and argue for the importance of considering the dynamic nature of education. As a consequence, in the last two sections of the chapter, we present the main assumptions and factors of the dynamic model of educational effectiveness and draw suggestions for establishing stronger links between ILSA and EER in order to promote the design of such studies that will be able to explain which factors can have a significant effect on student learning outcomes and consequently be able to propose the development of such educational policies that will promote quality and equity in education.

A Brief History of International Large-Scale Assessment Studies in Education Although in this chapter we largely think of ILSAs in education as a unified body, in essence, these studies display notable differences. However, in this section, we consider two broad strands of ILSAs in education: those conducted by the IEA and those carried out by the Organization for Economic Co-operation and Development (OECD) which are concerned with different types of student learning outcomes. We note, however, that several other ILSAs in education have also been conducted during these past 60 years (see Reynolds, 2006; Panayiotou et al., 2014). Providing a comprehensive overview of even one of the aforementioned strands of ILSA lies beyond the scope of this chapter; we rather concentrate on selected studies to illustrate some trends and critically discuss design decisions associated with these studies.

ILSAs Conducted by the IEA With respect to the IEA studies, the first ILSA study took place in 1960 and focused on five subjects: mathematics, science, reading comprehension, geography, and non-verbal ability (Purves, 1987). Carried out in 12 countries, and

256

L. Kyriakides et al.

focusing on students of 12 years of age, this study convinced scholars about the possibility of conducting ILSA studies whose results could have both theoretical and practical implications. Four years later, the First International Mathematics Study (FIMS) took place, focusing on a single subject matter and extending the student population to also involve students at their final year of secondary education (Husén, 1967; Postlethwaite, 1967). In 1970, this study was followed by a series of studies focusing on six different subject matters: science, reading comprehension, literature, foreign languages (English and French), and civic education, largely known as the Six Subject Survey. This first round of studies conducted in the mid-1960s and early 1970s pointed to different predictors of student outcomes, including opportunity to learn – based on how the curriculum is taught – student motivation, different methods of teaching, and school practices. It was not surprising then that new cycles of ILSA studies were conducted by the IEA in the next decades. Initiated by a study focusing again on mathematics (the Second International Mathematics Study, SIMS), the second cycle was conducted at the beginning of the 1980s, with an even broader country participation (20 countries); this study was again followed by a study focusing on Science (Second International Science Study) involving 24 countries. Because both these studies drew on items and questions from the two earlier corresponding studies, they did not only outline a picture of teaching mathematics and science in the participating countries, but they also enabled comparing and contrasting parts of this picture with how the two subjects were found to be taught; they also provided an opportunity for linking teaching with student performance in the first cycle. Even more critically, at this point, the designers of the IEA thought of conducting what came to be known as a Longitudinal Study (Purves, 1987). For a subset of the participating countries in SIMS, student performance data were obtained at two different time points. By obtaining both a pre-test and a post-test measure of achievement, this study enabled not only examining changes in student performance in mathematics but foremost gave the opportunity to investigate how different classroom and school characteristics contributed to this learning. This was also accomplished by another longitudinal IEA study, the Classroom Environment Study, which focused on the nature of the classroom activities and student-teacher interactions (Anderson, 1987). A similar longitudinal IEA study initiated in the 1980s, the Preprimary Project (cf. Olmsted & Weikart, 1995), examined how the early preprimary experiences of preprimary students of 4 years of age contributed to their cognitive and language performance at the age of 7. The decision to depart from this critical attribute that characterized the studies conducted in the 1980s, namely, their longitudinal character, limited the potential of these studies to yield information on student learning and its contributors. The 1990s marked the transition to ILSA studies that included even more countries and which focused on topics such as computers and civic education and engagement. The ILSAs focusing on mathematics and science conducted in the 1995, known as the Third International Mathematics and Science Study (TIMSS), turned out to be the first in a 4-year cycle of assessments in mathematics and science,

12

Using ILSAs to Promote Quality and Equity in Education: The Contribution. . .

257

currently known as Trends in International Mathematics and Science Study. An important attribute of the ILSA studies conducted in the 1990s concerns the two videotaped studies that aimed to examine the teaching of mathematics in a subset of the participating countries. Accompanying the TIMSS 1995 study, the first TIMSS Videotaped Classroom Study examined the practices of teaching eighth grade mathematics in the USA, Japan, and Germany by analyzing 231 videotaped lessons. Recognizing that surveys alone cannot tell much about the teaching that takes place in the classroom and using a national probability sample, the designers of this videotaped study videotaped each of the participating classrooms for one complete lesson on a date convenient for the teacher (see Stigler & Hiebert, 1999). Parallel to that – although not using videotaped lessons – another study conducted over 120 classroom observations in mathematics and science classrooms in 6 countries, attempting to portray what comprises a “typical” lesson in the disciplines under consideration in those countries (Schmidt et al., 1996). The videotaping endeavor was also undertaken 4 years later in the context of TIMSS 1999, given that it was recognized that “to better understand, and ultimately improve, students’ learning, one must examine what happens in the classroom” (Hiebert et al., 2003, p. 2). The second TIMSS Videotaped Classroom Study extended its scope to include videotaped lessons from Australia, the Czech Republic, Hong Kong SAR, the Netherlands, Switzerland, the USA, and Japan. Despite the criticism fired at both these studies because of their sampling and the argument that the lessons videotaped were not necessarily representative of the typical teaching in each country, both videotaped studies afforded the research community an indispensable opportunity to not only peer into the classrooms and examine different approaches in teaching but foremost to start understanding how teaching practices can contribute to student outcomes (Creemers et al., 2010). The problem, however, was the absence of longitudinal data that characterized the IEA studies during the previous decade. Had TIMSS 1995 and 1999 also collected pre-test data, it would have been possible to examine how certain practices contribute to student learning. Despite this and other limitations, looking inside the classrooms – either through videotapes or through “live” classroom observations – marked a significant shift in the ILSAs carried out so far, since it emitted a significant message: that survey data alone may not tell much of the story of how teaching contributes to student learning (Caro et al., 2016). In addition to carrying out a TIMSS study on mathematics and science every 4 years, the IEA also conducted a series of studies in other subjects, such as Reading Literacy (i.e., the Progress in International Reading Literacy Study, PIRLS) and Information Technology and Computers (i.e., International Computer and Information Literacy Study, ICILS). What perhaps characterizes an important shift in the twenty-first century is the first international study to be conducted at tertiary education, the Teacher Education and Development Study in Mathematics (TEDS-M) (Döhrmann et al., 2012). This study investigated the policies, programs, and practices for the preparation of the future primary and lower-secondary mathematics teachers in 17 countries (Tatto et al., 2012). By initiating this study, the IEA

258

L. Kyriakides et al.

underlined the importance of exploring variables that may affect student achievement indirectly – such as teacher knowledge and preparation– and which can inform decisions on teacher education at the district or national level (Blömeke et al., 2015). Studies such as TEDS-M can be thought as a main avenue for collecting and channeling information to policy makers and other stakeholders about the effectiveness of tertiary education, just like the studies reviewed above have attempted to do for compulsory education. To summarize, the ILSAs conducted by the IEA covered a wide range of topics; adopted multiple data collection approaches, ranging from student tests, student/teacher/school self-reports to classroom observations; and pertained to different levels of education, ranging from pre-primary to tertiary education.

ILSAs Carried Out by the OECD Measuring Student Learning Outcomes We now move to the ILSA studies conducted by the OECD which had a somewhat different scope in terms of the type of outcomes measured and the data collection approaches pursued. Below we briefly focus only on one such OECD set of studies, the Programme for International Student Assessment (PISA) studies; we do so because of its focus on student achievement and its aim to identify potential predictors of student learning. Initiated in 2000, PISA studies are conducted every 3 years and mainly focus on three subject matters: mathematics, reading, and science. In contrast to the IEA studies, PISA studies measure skills and knowledge of students of 15 years of age, near the end of their compulsory education (OECD, 2017); therefore, these studies target the student population based on age, instead of grade-level. Additionally, unlike the IEA studies, PISA studies are literacy-oriented rather than curriculum-oriented; this implies that instead of examining mastery of specific school curricula, PISA studies investigate the extent to which students are able to apply knowledge and skills in the subject areas under consideration in a gamut of “authentic” situations, including analyzing, reasoning, communicating, interpreting, and solving problems (Lingard & Grek, 2008). So far, more than 90 countries have participated in PISA studies, which mainly use student tests and self-reports (of students, teachers, school headmasters, and parents) as the main data collection instruments. Regardless of their origin (IEA or OECD), their data collection methods, their student populations, and their underlying assumptions, the ILSA studies conducted so far have provided the research community with the means to start understanding not only student performance in one country relative to the other countries – which often turns out to be one of the misuses of such studies – but foremost to start grasping how different factors, be they related to the student, the classroom, the teacher, the school, the curriculum, or other wider contextual and system factors, contribute to student learning (e.g., Kyriakides & Charalambous, 2014; Caro et al., 2016). A similar agenda characterizes EER as it is explained below.

12

Using ILSAs to Promote Quality and Equity in Education: The Contribution. . .

259

A Brief Historical Overview of the Educational Effectiveness Research EER can be seen as an overarching theme that links together a multinational body of research in different areas, including research on teacher behavior and its impacts, curriculum, student grouping procedures, school organization, and educational policy. The main research question underlying EER is the identification and investigation of which factors operating at different levels, such as the classroom, the school, and educational system, can directly or indirectly explain measured differences (variations) in the outcomes of students. Further, such research frequently takes into account the influence of other important background characteristics, such as student ability, socioeconomic status (SES), and prior attainment. Thus, EER attempts to establish and test theories which explain why and how some schools and teachers are more effective than others in promoting better outcomes for students (Scheerens, 2016). The origins of EER largely stem from reactions to seminal work on equality of opportunity in education that was conducted in the USA and undertaken by Coleman et al. (1966) and Jencks et al. (1972). These two studies coming from two different disciplinary backgrounds – sociology and psychology, respectively – drew very similar conclusions in relation to the amount of variance in student learning outcomes that can be explained by educational factors. Although these studies did not suggest schooling was unimportant, the differences in student outcomes that were attributable to attending one school rather than another were modest. However, these studies were criticized for failing to measure the educational variables that were of the most relevance (Madaus et al., 1979). Nevertheless, these two studies claimed that after taking into consideration the influence of student background characteristics, such as ability and family background (e.g., SES, gender, ethnicity), only a small proportion of the variation in student achievement could be attributed to the school or educational factors. This pessimistic feeling of not knowing what, if anything, education could contribute to reducing inequality in educational outcomes and in society as a whole was also fed by the apparent failure of large-scale educational compensatory programs, such as “Headstart” and “Follow Through,” conducted in the USA, which were based on the idea that education in preschool/ schools would help to compensate for the initial differences between students. As a consequence, the first two school effectiveness studies that were independently undertaken by Brookover, Beady, Flood, Schweitzer, and Wisenbaker (1979) in the USA and Rutter, Maughan, Mortimore, Ouston, and Smith (1979) in England during the 1970s were concerned with examining evidence and arguing in support of the potential power of schooling to make a difference to the life chances of students. One may therefore consider these two projects as the first attempts to show the contribution that teachers and schools may make to reduce unjustifiable differences in student learning outcomes. By providing encouraging results regarding the effect of teachers and schools on student outcomes, these two studies have paved the way for a series of studies that followed in different countries and with different student populations, all aiming to further unpack and understand how teachers and schools

260

L. Kyriakides et al.

contribute to student learning. The establishment of the International Congress for School Effectiveness and School Improvement, along with its related journal School Effectiveness and School Improvement funded in 1990, formally heralded the development of a new field concerned with understanding how the classroom and school processes influence student learning. A plethora of studies have been conducted in the field of EER, which can be grouped in four phases. Mainly concerned with proving that teachers and schools do matter for student learning, the studies conducted during the first phase set the foundations for this new field. Conducted in the early 1980s, the studies of this phase attempted to show that there were differences in the impact of particular teachers and schools on student learning outcomes. Once empirical evidence regarding the effect of education on student learning has begun to amount, scholars in EER further refined their research agenda, trying to understand the magnitude of the school effects to student learning. By the end of this phase, mounting empirical evidence was accrued showing that schools and teachers do matter for students learning. After having established the role of teachers and schools in student learning, the next step was to understand what contributes to this learning. Therefore, the studies of the second phase of EER largely aimed at identifying factors that can help explain differences in the educational effectiveness. As a result of the studies undertaken during this phase, lists of correlates associated with student achievement were generated, often leading to models of educational effectiveness. The models proposed during this phase emphasized the importance of developing more sound theoretical foundations for EER, an endeavor undertaken during the next phase (Scheerens, 2013). During the third phase of EER, three perspectives within EER have been developed which attempted to explain why and how certain characteristics contribute to the promotion of student learning outcomes (i.e., the quality dimension of effectiveness): (1) the economic perspective, (2) the psychological perspective, and (3) the sociological perspective. These perspectives are associated with specific theories. For example, the economic perspective takes into account theories within the field of economics education such as the human capital theory (Becker, 1975). Similarly, the sociological perspective takes into account sociological theories such as the social reproduction theory (Collins, 2009). In this section, we describe each perspective and its implications for modeling effectiveness. Firstly, in order to explain variation in the effectiveness of teachers and schools, economists focused on variables concerned with resource inputs, such as the per student expenditure. Specifically, the economic approach focused on producing a mathematical function which revealed the relationship between the “supply of selected purchased schooling inputs and educational outcomes controlling for the influence of various background features” (Monk, 1992, p. 308). Thus, the associated emergence of “education production” models (e.g., Brown & Saks, 1986; Elberts & Stone, 1988) were based on the assumption that increased inputs will lead to increments in outcomes. The second model to emerge from this phase of EER featured a sociological perspective and focused on factors that define the educational and family background of students, such as SES, ethnic group, gender, social-capital, and peer group. This perspective examined not only student outcomes but also the extent to

12

Using ILSAs to Promote Quality and Equity in Education: The Contribution. . .

261

which schools manage to ameliorate or increase the variation in student outcomes when compared to prior achievement. As a consequence, this perspective of EER drew attention to the importance of using two dimensions to measure school effectiveness: these were concerned not only with improving the quality of schools (i.e., supporting students to achieve good outcomes) but also with enhancing equity in schools (i.e., reducing the achievement gaps between advantaged and disadvantaged groups). Attention was also given to identify factors promoting quality which emerged from organizational theories (including climate, culture, and structure) (see Reynolds et al., 2014; Scheerens, 2016) as well as with contexts/characteristics such as the concentration of disadvantaged students and the impacts of this on student outcomes and school and classroom processes (Opdenakker & van Damme, 2006). Finally, educational psychologists in this period focused on student background factors such as “learning aptitude” and “motivation” and on variables measuring the learning processes which take place in classrooms. Further, there was an interest in identifying and understanding the features of effective instructional practice, and this led to the development of a list of teacher behaviors that were positively and consistently related to student achievement over time (Brophy & Good, 1986). In essence, the work associated with this last perspective contributed to a re-orientation of EER – both theoretically and empirically – on the processes transpiring at the teaching and learning level, thus considering factors at the classroom or the teaching level as the primary contributors to student learning. More recently, there also seems to be a shift from focusing only on observable teaching behaviors to additionally considering aspects of teacher cognition and thinking, as potential contributors to both teaching quality and student learning (e.g., Hill, Ball, & Schilling, 2008; Shechtman, Roschelle, Haertel, & Knudsen, 2010; Hamre et al., 2012; Charalambous, 2016). Having shown that schools matter and having generated models that explain teacher and school effectiveness, during the fourth and most recent phase of EER, scholars are largely preoccupied with issues of complexity. A gradual move from the third to the fourth phase is particularly observed after 2000, when researchers started to realize that educational effectiveness should not be seen as a stable characteristic and should rather be considered as a dynamic attribute of teachers and schools, and one that might vary across years, different student populations, different outcomes, and even different subject matters (see, e.g., Charalambous et al., 2019; Creemers & Kyriakides, 2008). Consequently, EER scholars have started to attend to issues such as growth and change over time – which has become the major focus of this phase – as well as issues such as consistency, stability, and differential effectiveness (Scheerens, 2016). From this respect, it is not a coincidence that during this phase, school effectiveness is coming even closer to school improvement, aiming to propose and empirically test how different theoretical models of educational effectiveness can contribute to the improvement of the functioning of schools (Kyriakides et al., 2018). Because of this emphasis on change, theoretical developments in EER during this phase have also been associated with methodological developments which have backboned this new research agenda of EER. In fact, a closer attention to the

262

L. Kyriakides et al.

evolution of EER suggests that the theoretical and empirical developments of EER studies have been accompanied and supported by methodological advancements, as summarized below. During the first phase, major emphasis was given on outlier studies that compared the characteristics of more and less effective schools. Because of conceptual and methodological concerns associated with these studies, during the second and third phase, researchers moved to cohort designs and more recently to longitudinal designs involving large numbers of schools and students. Additionally, during the last two phases, emphasis is given to searching predictors that have indirect effects, in addition to the direct effects examined in the previous two phases (see Creemers et al., 2010). The employment of advanced techniques, such as multilevel modeling to account for the nested nature of educational data and the development of contextual value-added models that controlled for student prior attainment background characteristics, as well as contextual measures of class or school composition (Harker & Tymms, 2004; Opdenakker & van Damme, 2006), also contributed significantly to the development of EER studies. The recent development of multilevel structural equation modeling (SEM) approaches (see Hox & Roberts, 2011) is also envisioned to advance work in EER, by enabling researchers to search for indirect effects and examine the validity of recent EER models, which are multilevel in nature (Kyriakides et al., 2015). Finally, during this last phase of EER, emphasis is paid on longitudinal models with at least 3 years of measurement, attempting to examine how changes in the functioning of the effectiveness factors under consideration are associated with changes in educational effectiveness. By employing such longitudinal approaches, scholars can also investigate reciprocal relationships between different factors, which are advanced by more recent EER theoretical developments.

Connections Between ILSA and EER Although having a similar agenda – to understand what contributes to student achievement/learning – in the past 60 years, ILSA and EER seemed to have evolved as two distinct and rather unrelated domains; despite this, each area has contributed to the other in significant ways. In the first part of this section, we refer to the contribution of ILSA to EER and then identify the ways in which EER has informed ILSA. The first way in which ILSA contributed to EER pertains to highlighting the importance of and mobilizing the resources for conducting educational effectiveness studies. In particular, the ultimate goal of ILSA seems to have been to raise awareness of the importance of education and its effects. Admittedly, because of the media pressure, often times the results of ILSA have been misinterpreted and misused in simplistic ways to either rank order countries based on student performance or to even transplant ideas from systems that “proved to be working” to less effective systems, without any detailed acknowledgment of the possible context specificity of the apparently “effective” policies in the original societies utilizing them. For example, Reynolds (2006) attributed the British enthusiasm for whole class direct instruction at Key Stage 2 in British primary schools to a simplistic

12

Using ILSAs to Promote Quality and Equity in Education: The Contribution. . .

263

association between the educational practices of the Pacific Rim and their high levels of achievement in international studies. Without underestimating these side effects arising from ILSA, information yielded from such studies has been employed more constructively to inform policy makers, curriculum specialists, and researchers by functioning as a mirror through which each participating country can start better grasping its educational system (Schmidt & Valverde, 1995). This opportunity to look closer at the educational systems of different countries has motivated policy makers to fund research projects aiming to understand the nature of educational effectiveness and identify ways of improving school effectiveness. Consequently, a number of EER projects were initiated in various countries (e.g., Baumert et al., 2010; Panayiotou et al., 2014). Another way in which EER has benefited from ILSA pertains to the national, and in some respects limiting, character of most EER studies (Muijs et al., 2014). Because of their international character, the data emerging from ILSA have much larger variance in the functioning of possible predictors of student outcomes given that dissimilarities are more likely to occur across rather within countries. This large variation, in turn, increases statistical power and gives the opportunity to scholars working in EER and capitalizing on data from ILSA not only to figure out whether variables at the teacher and school level predict student outcomes. By affording researchers in the field of EER-rich international datasets, ILSA also enabled them to conduct secondary analyses (e.g., Maslowski et al., 2007; Caro et al., 2016; Charalambous & Kyriakides, 2017; Caro et al., 2018) which can facilitate the theoretical and methodological development of the field. Additionally, researchers can conduct both across- and within-country analyses and search for the extent to which specific factors at the classroom and/or the school level can be treated as generic while others can be seen as country-specific (Gustafsson, 2013). By also including information on several contextual factors, ILSA provided a platform for examining why some factors operate mainly in specific contexts while others “travel” across countries (Creemers et al., 2010). Turning to the contribution of EER to ILSA, perhaps the main way in which the former has been supportive of the latter pertains to providing theoretical constructs to inform the design of ILSA. This influence is more palpable in the most recent PISA studies (e.g., PISA 2012, PISA 2015 and PISA 2018), which have capitalized on theoretical models from EER to develop both the theoretical frameworks undergirding those studies and the associated measurement tools (see OECD, 2016). In fact, in more recent ILSA, emphasis is placed on process variables at the classroom and school level which are drawn from effectiveness factors included in EER models. For example, instead of examining the school climate factor at a more general level, recent PISA studies concentrate on specific school learning environment factors shown to be associated with student achievement, as suggested by meta-analyses conducted in the field of EER (Kyriakides et al., 2010; Scheerens, 2013). Also, instead of examining teaching behaviors more generally or the content taught, recently ILSA have started investigating the impact of specific teaching behaviors, be they generic or content specific. This broadening of their scope also aligns with recent meta-analyses of EER pointing to the importance of exploring

264

L. Kyriakides et al.

both types of practices as potential contributors to student learning (e.g., Seidel & Shavelson, 2007; Kyriakides et al., 2013; Scheerens, 2016). The second contribution of EER pertains to its impact on the methodological design of ILSA and the analysis of the data yielded from such studies. Multilevel analysis has been a prominent approach in EER especially during the last three decades, because of taking into consideration the nested character of educational data. The systematic use of multilevel analysis in EER has also been fueled by the type of research questions addressed in this field. At the beginning of this century, employing multilevel techniques to analyze ILSA data has been stressed (e.g., Kyriakides & Charalambous, 2014), and such approaches have made their way in both the technical reports of various ILSA studies (e.g., OECD, 2009, 2016; Schulz et al., 2018) as well as in secondary analyses of ILSA data (e.g., Gustafsson, 2013; Caro et al., 2016). EER scholars have also noted the limitations of using cohort data from ILSA to identify causal relationships between effectiveness factors and student learning outcomes (Gustafsson, 2013; Caro et al., 2018). Recognizing these limitations, these scholars have pointed to the importance of identifying trends in the functioning of different factors across years. Identifying such trends will enable both the design of policy reforms to account for any undesirable changes in the teaching quality and/or student learning that occur over time, as well as the evaluation of interventions designed toward improving teaching and learning. From this respect, the importance of maintaining focus on certain theoretical constructs and using identical items/scales across different cycles of ILSA studies such as PISA and TIMSS is stressed. The examination of the contribution of ILSA to EER and vice versa as outlined above is by no means comprehensive; rather it indicates how in the past 50 years each domain has informed the design and has contributed to the evolution of the other domain. Admitting that this mutual interaction between these domains has been beneficial for both and assuming that it will continue in the future, in what follows we argue that from a theoretical perspective, ILSAs can be informed by the gradual shift currently witnessed in EER to move from a static to a more dynamic conceptualization of the process of teaching/schooling/learning. Specifically, we claim that the dynamic model of educational effectiveness could be a foundation upon which ILSAs could be based, both in terms of their design and in terms of secondary data analysis which can help better understand what contributes to student learning and through that develop reform policies to promote quality and equity in education. In this context, the next section provides a brief overview of the dynamic model, whereas in the last section implications for ILSA are provided.

The Dynamic Model of Educational Effectiveness: An Overview The dynamic model of educational effectiveness represents the outcome of a systematic attempt to develop a framework of effectiveness that is able to consider the dynamic nature of education and that is comprehensive enough to be able to be used by stakeholders in education for improving the outcomes of educational efforts

12

Using ILSAs to Promote Quality and Equity in Education: The Contribution. . .

265

(Sammons, 2009). It takes into account the new goals of education which more broadly define the expected outcomes of schooling and are not restricted solely to the achievement of basic skills. Taking into account the need for promoting not only students’ cognitive skills, but also metacognition, as well as affective and psychomotor skills, the dynamic model aligns with the need for viewing education in a more holistic manner and comprises a step to evolve previous theories of educational effectiveness. This suggests that the models of EER should take into account the new goals of education and relate them to their implications for teaching and learning (van der Werf et al., 2008). This means that the outcome measures should be defined in a broader way rather than being restricted to the achievement of basic skills. It also implies that new theories of teaching and learning should be used in order to specify variables associated with quality of teaching. It is important to note here that this assumption has been considered in testing the validity of the dynamic model since the impact of factors on promoting different types of student learning outcomes (cognitive, affective, psychomotor, and meta-cognitive) was investigated (Kyriakides et al., 2020). The dynamic model is not only parsimonious in using a single framework to measure the functioning of factors (see Sammons, 2009; Bates, 2010) but also aims to describe the complex nature of educational effectiveness (see Scheerens, 2013; Reynolds et al., 2014). This implies that the model is based on specific theories but at the same time some of the factors included in the major constructs of the model are expected to be interrelated within and/or between levels. Specifically, the dynamic model is multilevel in nature and refers to factors operating at four different levels (see Fig. 1): student, classroom/teacher, school, and system. Student-level factors are classified into three categories: (a) sociocultural and economical background variables emerged from the sociological perspective of EER such as SES, gender, and ethnicity, (b) background variables emerged from the psychological perspective of EER such as motivation and thinking style, and (c) variables related to specific learning tasks associated with the learning outcomes used to measure effectiveness such as prior achievement, time on task, and opportunity to learn (see Creemers & Kyriakides, 2008). It is also acknowledged that variables of the second category such as subject motivation and thinking styles are likely to change over time and student achievement is expected to have a reciprocal relationship with the factors of this category. For example, students with higher scores in subject motivation are expected to achieve better results at the end of the school year, but at the same time those who have higher achievement scores are likely to develop more positive attitudes toward the subject (Bandura, 1989; Marsh, 2008). In this context, this category consists of factors that can be improved through teacher and school initiatives and are therefore treated not only as factors of effectiveness but also as learning outcomes of effective schools. On the other hand, the first category refers to background factors that are not only associated with the sociological perspective of EER but are also not likely to change as a result of teacher and school initiatives. For example, SES is associated with student learning outcomes, but neither schools nor individual students have the power to change this factor. Nevertheless, teachers and schools need to consider the background characteristics of individual students in order to respond to their needs and in

266

L. Kyriakides et al.

National/Regional policy for education Evaluation of policy The educational environment

School Policy Evaluation of School Policy

Quality of teaching - Orientation - Structuring - Modelling - Application - Questioning - Assessment - Management of Time - Classroom as a learning environment

Outcomes  Cognitive  Affective  Psychomotor  New learning

Aptitude Expectations SES Figure 1 : The dynamic model of educational effectiveness Perseverance Thinking style Gender Time on task Subject motivation Ethnicity Opportunity to learn

Fig. 1 The dynamic model of educational effectiveness

this way contribute to the promotion of both quality and equity (Kyriakides et al., 2018). Moreover, it is argued that teachers and schools should aim to reduce the impact of these factors on student achievement in order to promote equity. Finally, the importance of searching for relationships among factors operating at the student level is stressed. In the case of student factors that remain stable over time, it is also considered important to investigate interaction effects, especially since research on

12

Using ILSAs to Promote Quality and Equity in Education: The Contribution. . .

267

equal opportunities has revealed that there are significant interactions between SES, gender, and ethnicity when it comes to studying differences in the educational outcomes of different groups of students (Kyriakides et al., 2019). At classroom level, the model refers to the following eight factors which describe teachers’ instructional role: orientation, structuring, questioning, teaching-modeling, application, management of time, teacher role in making classroom a learning environment, and classroom assessment. The model refers to skills associated with direct teaching and mastery learning (Joyce et al., 2000) such as structuring and questioning. Factors included in the model such as orientation and teaching-modeling are in line with theories of teaching associated with constructivism (Brekelmans et al., 2000). Thus, an integrated approach to quality of teaching is adopted. Since learning takes place primarily at the classroom level, factors situated at the school and system level are expected to influence primarily the teaching practice and through that, student learning outcomes. School-level factors are, therefore, expected to influence the teaching-learning situation by developing and evaluating the school policy on teaching and the policy on creating the school learning environment. The system level refers to the influence of the educational system in a more formal way, especially by developing and evaluating the educational policy at the national/ regional level. The teaching and learning situation is also influenced by the wider educational context in which students, teachers, and schools operate. One essential difference of the dynamic model from all the other theoretical models of EER has to do with its attempt to propose a specific framework for measuring the functioning of teacher/school/system level factors. The model is based on the assumption that each factor can be defined and measured using five dimensions: frequency, focus, stage, quality, and differentiation. First, the frequency dimension refers to the number of times that an activity associated with a factor is present in a system, school, or classroom. This is probably the easiest way to measure the effect of a factor on student achievement, and almost all studies used this dimension to define effectiveness factors (Creemers & Kyriakides, 2008). The other four dimensions are concerned with qualitative characteristics of the factors included in the dynamic model. Specifically, two aspects of the focus dimension are taken into account. The first one refers to the specificity of the activities, which can range from specific to general. The second aspect of this dimension addresses the purpose for which an activity takes place. An activity may be expected to achieve a single purpose or multiple purposes. Third, the activities associated with a factor can be measured by taking into account the stage at which they take place. Factors need to take place over a long period of time to ensure that they have a continuous direct or indirect effect on student learning (Levin, 2005). Fourth, the quality dimension refers to the properties of the specific factor itself as discussed in the literature. For instance, teacher assessment can be measured by looking at the extent to which the formative rather than the summative purpose is served (Marshall & Drummond, 2006; Black & Wiliam, 2009). Finally, differentiation refers to the extent to which activities associated with a factor are implemented the same way for all subjects involved with it. The use of different measurement dimensions reveals that looking at just the frequency dimension of an

268

L. Kyriakides et al.

effectiveness factor does not help us identify those aspects of the functioning of a factor which are associated with student achievement. Thus, the five dimensions are not only important from a measurement perspective but also, and to a greater degree, from a theoretical point of view (Kyriakides et al., 2018). Finally, the dynamic model does not only refer to factors associated with student achievement gains but also assume that there is a close relation between the two dimensions of effectiveness (quality and equity). However, this assumption has never been made explicit and the relation between the quality and the equity dimensions of effectiveness has not been examined. During the last 15 years, several longitudinal studies conducted in different countries (e.g., Creemers & Kyriakides, 2010; Kyriakides et al., 2010; Azkiyah et al., 2014; Christoforidou & Xirafidou, 2014; Panayiotou et al., 2014; Azigwe et al., 2016; Paget, 2018; Lelei, 2019; Musthafa, 2020) and two meta-analyses of effectiveness studies (Kyriakides et al., 2010; Kyriakides et al., 2013) provided support to the validity of the dynamic model by demonstrating relations between teacher and school factors included in the model with student achievement. This implies that factors of the dynamic model may be taken into account for promoting quality in education but does not necessarily mean that these factors are relevant for promoting equity. By introducing two different dimensions of measuring effectiveness, a question that arises is the extent to which teachers/schools/systems can be effective in terms of both quality and equity (Kyriakides et al., 2018). Most effectiveness studies, while examining the magnitude of teacher and school effects, have paid very little attention to the extent to which teachers and schools perform consistently across different school groupings (Campbell et al., 2004). During the last two decades, there has been an emphasis on investigating differential teacher and school effectiveness. In this chapter, it is argued that analysis of ILSA studies may help us search for differential teacher and school effectiveness. Such secondary analysis of ILSA data may provide a new perspective in the discussion about the dimensions of educational effectiveness and may reveal teacher/school/system factors that are associated not only with the quality but also with the equity dimension of effectiveness.

Advancements of ILSAs by Making Use of the Dynamic Model As explained in the previous section, during the fourth phase of EER, more emphasis is paid to the dynamic nature of educational effectiveness. Instead of merely focusing on what happens at a specific point in time (i.e., static approach), it is recommended that attention be paid to the actions taken by the teachers, schools, and other stakeholders to deal with the challenges that seem to impinge on learning. For example, some items of recent PISA cycles ask headmasters to indicate whether student learning is hindered by student or teacher absenteeism. Apart from collecting such static data, we argue that at least equal emphasis needs to be given to the actions that the educational systems and their constituent components (schools, teachers, parents) take in order to deal with this challenge (e.g., through reducing teacher/ student absenteeism, encouraging parental involvement, etc.). A similar argument

12

Using ILSAs to Promote Quality and Equity in Education: The Contribution. . .

269

applies to items in IEA studies: instead of simply tapping into students’ and teachers’ expectations, actions taken to raise expectations should also be investigated. We therefore argue that ILSA studies should develop their theoretical frameworks by considering the dynamic nature of effective schooling and teaching and based on that to design instruments which measure actions taken by the school policy makers and teachers to improve the school- and classroom-learning environment. A second possible way of advancing ILSA by drawing on lessons learned in the context of EER pertains to becoming more critical about the sources of data used in measuring different constructs. For example, information on school-level factors (e.g., the school learning environment) is typically gathered through questionnaires administered to the school headmaster. However, it has been found that student, teacher, and headmaster conceptions of school-level factors are independent of each other, whereas within-group perceptions (e.g., student or teacher perceptions alone) turn out to be quite consistent. These findings pose a challenging question: Can the data drawn from a questionnaire administered to a single person (i.e., the headmaster) yield accurate and valid information regarding school-level factors? Because the answer is rather in the negative, we argue that information on school-level factors should also be collected at least from the teaching staff, an approach that will allow not only testing the generalizability of the data but also reveal any potential biases. For example, in schools with a poor learning environment, teachers might be more inclined to portray the real situation compared to the headmaster, who may feel more responsible for the poor functioning of the school. A similar challenge pertains to the data collected for teaching, where most information is typically collected through surveys. Student ratings of teacher performance are frequently used in EER, although not without criticism. As direct recipients of the teaching-learning process, students are in a key position to provide information about teacher behavior in the classroom (Kyriakides et al., 2014). Moreover, student ratings constitute a main source of information regarding the development of motivation in the classroom, opportunities for learning, degree of rapport and communication developed between teacher and student, and classroom equity. Although more economically efficient, it should be acknowledged that surveys are typically less accurate to depict what goes in the classroom compared to classroom observations or log books (Pianta & Hamre, 2009; Rowan & Correnti, 2009). In earlier cycles of IEA studies, classroom observations were used in parallel to other data collecting approaches, something that suggests that similar approaches can also be employed in the future. The financial and other logistic difficulties inherent in conducting observations (e.g., training observers and monitoring their ability to consistently and adequately use the observation instruments in each country) cannot be underestimated. However, past experience has pointed to the multiple benefits of employing classroom observations, at least as complements to surveys. By employing classroom observations, more aspects related to effective teaching practices could be measured, covering both generic and domain-specific teaching skills found to be associated with student achievement gains. This dual focus on teaching skills has actually been manifested in more recent rounds of ILSA (e.g., in the most recent PISA self-report surveys) and hence seems to resonate with

270

L. Kyriakides et al.

the expansion of the ILSA agenda to also incorporate such practices especially since studies testing the validity of current theoretical models of EER (including the dynamic model) reveal that teacher behavior in the classroom can explain a large proportion of teacher effect on student learning outcomes (Muijs et al., 2014; Scheerens, 2016; Charalambous & Praetorius, 2018). Drawing on EER studies, ILSA could also collect prior achievement data, which will enable examining gains in student learning and consequently connect these gains to student, teacher, and school-level factors. One of the major methodological limitations of ILSA is related to the cross-sectional design. The lack of longitudinal data prevents researchers from interpreting observed associations with student achievement as effects. However, there are some successful attempts of different countries to collect follow-up data in the framework of ILSA studies, such as the study conducted by Baumert and his colleagues (Baumert et al., 2010) in Germany. Because prior achievement cannot be controlled for, it is possible that observed associations are confounded by prior achievement measures not available in the cross-sectional design. For instance, studies with ILSA datasets report weak or negative associations between student achievement and parental involvement in school (Caro, 2011) and teachers’ student-oriented practices (Caro et al., 2016). These results are against expectations and might actually reflect omitted prior achievement bias (Caro et al., 2018). That is, the associations might not only capture the positive effect of parental involvement and student-oriented practices on student achievement but also a negative correlation between these variables and unobserved prior achievement. Such negative correlation may arise as a result of remedial practices, for example, if teachers and parents react to low performance of students by assigning more homework, getting more involved, and individualizing teaching. Due to the lack of longitudinal achievement data, it is not possible to disentangle these two different effects. As a consequence, we argue here for the importance of collecting data on prior achievement especially since aptitude is treated as one of the most significant student-level factors not only by the dynamic model but also by earlier theoretical models of EER (including Carroll’s model of learning). Collecting such data will not be new to ILSA, since as the reader might recall, such data have been also collected in earlier cycles of ILSA studies. The benefits of collecting prior achievement data in future ILSA are expected to be multidimensional, not only in methodological terms but also in terms of informing policy decisions. For instance, if information on student progress rather than student final achievement is reported, the press might not continue adhering to simplistic approaches which pit one country against the other and ignore differences in student entrance competencies. This, in turn, might encourage underdeveloped countries to also participate in ILSA, for the focus will no longer be on how each country performs relative to the others but rather on the progress (i.e., student learning) made within each country. Recent developments in ILSA can also advance the work currently undertaken in the field of EER. Specifically, if the recommendations provided above are considered in future ILSA, richer datasets might be yielded. This, in turn, can benefit EER in significant ways, since secondary analyses employing such datasets could contribute to the testing and further development of theoretical frameworks of EER. One main

12

Using ILSAs to Promote Quality and Equity in Education: The Contribution. . .

271

element in these frameworks concerns the system-level factors, such as the national policy on teaching or the evaluation of such policies (cf. Creemers & Kyriakides, 2008). Testing these factors is an area that ILSA could be particularly useful, given that EER studies are mostly national and hence do not lend themselves to testing such factors. To successfully explore the effect of such factors, ILSA that collect data on the functioning of the educational system (e.g., OECD Teaching and Learning International Survey [TALIS] and INES Network for the Collection and Adjudication of System-Level Descriptive Information on Educational Structures, Policies and Practices [NESLI]) should expand their agenda to also investigate the national policies for teaching and the school learning environment. These data also need to be linked with student achievement data. This will enable investigating whether these system-level factors have direct and indirect effects – through school- and teacherlevel factors – on student learning outcomes (Kyriakides et al., 2018). For example, links could be established between studies such as TIMSS, NESLI, and TALIS. More specifically, in TALIS 2013, participating countries and economies had the option of applying TALIS questionnaires to a PISA 2012 subsample with the purpose of linking data on schools, teachers, and students. This database was called “TALISPISA Link.” The TALIS-PISA Link provided with important information about teaching strategies and their relationship with the characteristics of the school, the classroom, and student’s outcomes. A better understanding of these relationships can help teachers, schools, and policy makers to design more effective policies with the aim of improving students learning outcomes (Le Donné et al., 2016). Another example is the IEA Rosetta Stone project which is based on a strategy to measure global progress toward the UN Sustainable Development Goal (SDG) for quality in education by linking regional assessment results to TIMSS and PIRLS International Benchmarks of Achievement. The main goal is to develop a concordance table that translates scores resulting from regional mathematics and reading assessments on the TIMSS and PIRLS scales (see https://www.iea.nl/studies/additionalstudies/rosetta). Finally, we note that for many years, emphasis has been given to investigating cognitive outcomes, in both ILSA and EER. The dynamic model argues for the importance of considering new learning goals (such as self-regulation and metacognition) and their implications for teaching. Because of this recent reconceptualization of the mission of compulsory schooling to also incorporate new learning goals, PISA has lately included measures of these types of learning outcomes in addition to the traditional cognitive outcomes, such as the measurement of cooperative problem solving skills in PISA 2015 (OECD, 2017). Given that the instruments developed in the context of PISA studies have proven to have satisfactory psychometric properties, future EER studies can capitalize on these instruments to examine whether the impact of different effectiveness factors is consistent across different types of learning outcomes. For example, it is an open issue whether factors typically associated with the Direct and Active teaching approach, such as structuring and application relate only to cognitive outcomes, while constructivist-oriented factors such as orientation and modeling are associated with both types of outcomes. Secondary analyses of ILSA may therefore contribute in identifying generic factors which are associated with different learning outcomes in different contexts as well as

272

L. Kyriakides et al.

factors which are more relevant for specific types of learning outcomes and/or in specific contexts. In this way, the agenda of EER could be expanded from searching what works in education and why to finding out under which conditions and for whom these factors can promote different types of student learning outcomes (Kyriakides et al., 2020). In this chapter we have suggested that the two fields – ILSA and EER – not only have similar agendas but also have commonalities in several respects, ranging from design, to analysis, to how their results can inform policy. In the last part of the chapter, we have argued that from a theoretical perspective, ILSA can be informed by the gradual shift currently witnessed in EER from a static to a more dynamic conceptualization of the process of schooling. Specifically, it has been argued that the dynamic model of educational effectiveness could be a foundation upon which ILSAs are based, both in terms of their design and in terms of secondary data analysis which can help better understand what contributes to student learning. Because both fields give emphasis to providing evidence-based suggestions for improving policy, we believe that in the years to come a closer collaboration between scholars in both fields can advance both domains and better serve their common agenda: to understand what contributes to student learning and through that develop reform policies to promote quality and equity in education.

References Anderson, L. W. (1987). The classroom environment study: Teaching for learning. Comparative Education Review, 31(1), 69–87. Azigwe, J. B., Kyriakides, L., Panayiotou, A., & Creemers, B. P. M. (2016). The impact of effective teaching characteristics in promoting student achievement in Ghana. International Journal of Educational Development, 51, 51–61. Azkiyah, S. N., Doolaard, S., Creemers, B. P. M., & Van Der Werf, M. P. C. (2014). The effects of two intervention programs on teaching quality and student achievement. Journal of Classroom Interaction, 49(1), 4–11. Bandura, A. (1989). Regulation of cognitive processes through perceived self-efficacy. Developmental Psychology, 25(5), 729–735. https://doi.org/10.1037/0012-1649.25.5.729 Bates, R. (2010). The dynamics of educational effectiveness. British Educational Research Journal, 36(1), 166–167. Baumert, J., Kunter, M., Blum, W., Brunner, M., Voss, T., Jordan, A., . . . Tsai, Y. M. (2010). Teachers’ mathematical knowledge, cognitive activation in the classroom, and student progress. American Educational Research Journal, 47(1), 133–180. Becker, G. (1975). Human capital: A theoretical and empirical analysis (2nd ed.). Columbia University Press. Black, P., & Wiliam, D. (2009). Developing a theory of formative assessment. Educational Assessment, Evaluation and Accountability, 21(1), 5–31. Blömeke, S., Gustafsson, J. E., & Shavelson, R. (2015). Beyond dichotomies competence viewed as a continuum. Zeitschrift für Psychologie, 223(1), 3–13. Brekelmans, M., Sleegers, P., & Fraser, B. (2000). Teaching for active learning. In P. R. J. Simons, J. L. van der Linden, & T. Duffy (Eds.), New learning (pp. 227–242). Kluwer Academic Publishers.

12

Using ILSAs to Promote Quality and Equity in Education: The Contribution. . .

273

Brookover, W. B., Beady, C., Flood, P., Schweitzer, J., & Wisenbaker, J. (1979). School systems and student achievement: Schools make a difference. Praeger. Brophy, J., & Good, T. L. (1986). Teacher behaviour and student achievement. In M. C. Wittrock (Ed.), Handbook of research on teaching (3rd ed., pp. 328–375). Macmillan. Brown, B. W., & Saks, D. H. (1986). Measuring the effects of instructional time on student learning: Evidence from the beginning teacher evaluation study. American Journal of Education, 94(4), 480–500. https://doi.org/10.1086/443863 Campbell, R. J., Kyriakides, L., Muijs, R. D., & Robinson, W. (2004). Assessing teacher effectiveness: A differentiated model. RoutledgeFalmer. Caro, D. H. (2011). Parent-child communication and academic performance: Associations at the within- and between-country level. Journal for Educational Research Online, 3, 15–37. Caro, D. H., Kyriakides, L., & Televantou, I. (2018). Addressing omitted prior achievement bias in international assessments: An applied example using PIRLS-NPD matched data. Assessment in Education: Principles, Policy & Practice, 25(1), 5–27. Caro, D. H., Lenkeit, J., & Kyriakides, L. (2016). Teaching strategies and differential effectiveness across learning contexts: Evidence from PISA 2012. Studies in Educational Evaluation, 49, 30–41. Charalambous, C. Y. (2016). Investigating the knowledge needed for teaching mathematics: An exploratory validation study focusing on teaching practices. Journal of Teacher Education, 67(3), 220–237. Charalambous, C. Y., & Kyriakides, E. (2017). Working at the nexus of generic and content-specific teaching practices: An exploratory study based on TIMSS secondary analyses. The Elementary School Journal, 117(3), 423–454. Charalambous, C. Y., Kyriakides, E., Kyriakides, L., & Tsangaridou, N. (2019). Are teachers consistently effective across subject matters? Revisiting the issue of differential teacher effectiveness. School Effectiveness and School Improvement, 30(4), 353–379. https://doi.org/10. 1080/09243453.2019.1618877 Charalambous, C. Y., & Praetorius, A. K. (2018). Studying mathematics instruction through different lenses: Setting the ground for understanding instructional quality more comprehensively. ZDM Mathematics Education, 50(3), 355–366. https://doi.org/10.1007/s11858-0180914-8 Christoforidou, M., & Xirafidou, E. (2014). Using the dynamic model to identify stages of teacher skills in assessment. Journal of Classroom Interaction, 49(1), 12–25. Coleman, J. S., Campbell, E. Q., Hobson, C. F., McPartland, J., Mood, A. M., Weinfeld, F. D., & York, R. L. (1966). Equality of educational opportunity. US Government Printing Office. Collins, J. (2009). Social reproduction in classrooms and schools. Annual Review of Anthropology, 38, 33–48. https://doi.org/10.1146/annurev.anthro.37.081407.085242. Creemers, B. P. M., & Kyriakides, L. (2008). The dynamics of educational effectiveness: A contribution to policy, practice and theory in contemporary schools. Routledge. Creemers, B. P. M., & Kyriakides, L. (2010). School factors explaining achievement on cognitive and affective outcomes: Establishing a dynamic model of educational effectiveness. Scandinavian Journal of Educational Research, 54(1), 263–294. Creemers, B. P. M., Kyriakides, L., & Sammons, P. (2010). Methodological advances in educational effectiveness research. Routledge. Döhrmann, M., Kaiser, G., & Blömeke, S. (2012). The conceptualisation of mathematics competencies in the international teacher education study TEDS-M. ZDM, 44(3), 325–340. Elberts, R. W., & Stone, J. A. (1988). Student achievement in public schools: Do principles make a difference? Economics of Education Review, 7(3), 291–299. Gustafsson, J. E. (2013). Causal inference in educational effectiveness research: A comparison of three methods to investigate effects of homework on student achievement. School Effectiveness and School Improvement, 24(3), 275–295. Hamre, B. K., Pianta, R. C., Burchinal, M., Field, S., LoCasale-Crouch, J., et al. (2012). A course on effective teacher-child interactions: Effects on teacher beliefs, knowledge, and observed practice. American Educational Research Journal, 49(1), 88–123.

274

L. Kyriakides et al.

Harker, R., & Tymms, P. (2004). The effects of student composition on school outcomes. School Effectiveness and School Improvement, 15, 177–199. Hiebert, J., Gallimore, R., Garnier, H., Givvin, K. B., Hollingsworth, H., Jacobs, J., & Stigler, J. (2003). Teaching mathematics in seven countries: Results from the TIMSS 1999 video study. Education Statistics Quarterly, 5(1), 7–15. Hill, H. C., Ball, D. L., & Schilling, S. G. (2008). Unpacking pedagogical content knowledge: Conceptualizing and measuring teachers’ topic-specific knowledge of students. Journal for Research in Mathematics Education, 39(4), 372–400. Hox, J. J., & Roberts, J. K. (Eds.). (2011). Handbook of advanced multilevel analysis. Routledge. Husén, T. (Ed.). (1967). International study of achievement in mathematics: A comparison of twelve countries (Vol. 1–2). Almqvist & Wiksell. Jencks, C., Smith, M., Acland, H., Bane, M. J., Cohen, D., Gintis, H., Heyns, B., & Michelson, S. (1972). Inequality: A reassessment of the effects of family and schooling in America. Basic Books. Joyce, B. R., Weil, M., & Calhoun, E. (2000). Models of teaching. Allyn & Bacon. Kyriakides, L., Anthimou, M., & Panayiotou, A. (2020). Searching for the impact of teacher behavior on promoting students’ cognitive and metacognitive skills. Studies in Educational Evaluation, 64. https://doi.org/10.1016/j.stueduc.2019.100810 Kyriakides, L., & Charalambous, C. Y. (2014). Educational effectiveness research and international comparative studies: Looking back and looking forward. In R. Strietholt, W. Bos, J.-E. Gustafsson, & M. Rosén (Eds.), Educational policy evaluation through international comparative assessments (pp. 33–50). Waxmann. Kyriakides, L., Christoforou, C., & Charalambous, C. Y. (2013). What matters for student learning outcomes: A meta-analysis of studies exploring factors of effective teaching. Teaching and Teacher Education, 36, 143–152. Kyriakides, L., Creemers, B., Antoniou, P., & Demetriou, D. (2010). A synthesis of studies searching for school factors: Implications for theory and research. British Educational Research Journal, 36(5), 807–830. Kyriakides, L., Creemers, B. P. M., Antoniou, P., Demetriou, D., & Charalambous, C. (2015). The impact of school policy and stakeholders’ actions on student learning: A longitudinal study. Learning and Instruction, 36, 113–124. Kyriakides, L., Creemers, B. P. M., & Charalambous, E. (2018). Equity and quality dimensions in educational effectiveness. Springer. Kyriakides, L., Creemers, B. P. M., & Charalambous, E. (2019). Searching for differential teacher and school effectiveness in terms of student socioeconomic status and gender: Implications for promoting equity. School Effectiveness and School Improvement, 30(3), 286–308. Kyriakides, L., Creemers, B. P. M., Panayiotou, A., Vanlaar, G., Pfeifer, M., Gašper, C., & McMahon, L. (2014). Using student ratings to measure quality of teaching in six European countries. European Journal of Teacher Education, 37(2), 125–143. Kyriakides, L., Georgiou, M. P., Creemers, B. P. M., Panayiotou, A., & Reynolds, D. (2018). The impact of national educational policies on student achievement: A European study. School Effectiveness and School Improvement, 29(2), 171–203. Le Donné, N., Fraser, P., & Bousquet, G. (2016). Teaching strategies for instructional quality: Insights from the TALIS-PISA link data (OECD education working paper no. 148). OECD Directorate for Education and Skills. https://doi.org/10.1787/5jln1hlsr0lr-en Lelei, H. (2019). A case study of policy and actions of Rivers state, Nigeria to improve teaching quality and the school learning environment (Unpublished doctoral dissertation). School of Education. Levin, B. (2005). Governing education. University of Toronto Press. Lingard, B., & Grek, S. (2008). The OECD, indicators and PISA: An exploration of events and theoretical perspectives (ESRC/ESF research project on fabricating quality in education. Working paper 2). Madaus, G. G., Kellagham, T., Rakow, E. A., & King, D. (1979). The sensitivity of measures of school effectiveness. Harvard Educational Review, 49(2), 207–230.

12

Using ILSAs to Promote Quality and Equity in Education: The Contribution. . .

275

Marsh, H. (2008). Big-fish-little-pond-effect: Total long-term negative effects of school-average ability on diverse educational outcomes over 8 adolescent/early adult years. International Journal of Psychology, 43(3–4), 53–54. Marshall, B., & Drummond, M. J. (2006). How teachers engage with assessment for learning: Lessons from the classroom. Research Papers in Education, 21(2), 133–149. Maslowski, R., Scheerens, J., & Luyten, H. (2007). The effect of school autonomy and school internal decentralization on students’ reading literacy. School Effectiveness and School Improvement, 18(3), 303–334. Monk, D. H. (1992). Education productivity research: An update and assessment of its role in education finance reform. Educational Evaluation and Policy Analysis, 14(4), 307–332. Muijs, R. D., Kyriakides, L., van der Werf, G., Creemers, B. P. M., Timperley, H., & Earl, L. (2014). State of the art-teacher effectiveness and professional learning. School Effectiveness and School Improvement, 25(2), 231–256. Musthafa, H. S. (2020). A longitudinal study on the impact of instructional quality on student learning in primary schools of Maldives (Unpublished doctoral dissertation). University of Cyprus, Nicosia, Cyprus. OECD. (2009). PISA 2006 data analysis manual. OECD Publications. OECD. (2016). PISA 2015 assessment and analytical framework: Science, reading, mathematic and financial literacy. PISA, OECD Publishing. OECD. (2017). PISA 2015 Assessment and analytical framework: Science, reading, mathematic, financial literacy and collaborative problem solving. OECD Publishing. https://doi.org/10. 1787/9789264281820-en. Olmsted, P. P., & Weikart, D. P. (Eds.). (1995). The IEA preprimary study: Early childhood care and education in 11 countries. Elsevier Science. Opdenakker, M. C., & Van Damme, J. (2006). Differences between secondary schools: A study about school context, group composition, school practice, and school effects with special attention to public and Catholic schools and types of schools. School Effectiveness and School Improvement, 17(1), 87–117. Paget, C. (2018). Exploring school resource and teacher qualification policies, their implementation and effects on schools and students’ educational outcomes in Brazil (Unpublished doctoral dissertation). University of Oxford, UK. Panayiotou, A., Kyriakides, L., Creemers, B. P. M., McMahon, L., Vanlaar, G., Pfeifer, M., Rekalidou, G., & Bren, M. (2014). Teacher behavior and student outcomes: Results of a European study. Educational Assessment, Evaluation and Accountability, 26, 73–93. Pianta, R., & Hamre, B. K. (2009). Conceptualization, measurement, and improvement of classroom processes: Standardized observation can leverage capacity. Educational Researcher, 38(2), 109–119. Postlethwaite, N. (1967). School organization and student achievement: A study based on achievement in mathematics in twelve countries. Almqvist & Wiksell. Purves, A. C. (1987). The evolution of the IEA: A memoir. Comparative Education Review, 31(1), 10–28. Reynolds, D. (2006). World class schools: Some methodological and substantive findings and implications of the international school effectiveness research project (ISERP). Educational Research and Evaluation, 12(6), 535–560. Reynolds, D., Sammons, P., De Fraine, B., Van Damme, J., Townsend, T., Teddlie, C., & Stringfield, S. (2014). Educational effectiveness research (EER): A state-of-the-art review. School Effectiveness and School Improvement, 25(2), 197–230. Rowan, B., & Correnti, R. (2009). Studying reading instruction with teacher logs: Lessons from the study of instructional improvement. Educational Researcher, 38(2), 120–131. Rutter, M., Maughan, B., Mortimore, P., Ouston, J., & Smith, A. (1979). Fifteen thousand hours: Secondary schools and their effects on children. Harvard University Press. Sammons, P. (2009). The dynamics of educational effectiveness: A contribution to policy, practice and theory in contemporary schools. School Effectiveness and School Improvement, 20(s1), 123–129.

276

L. Kyriakides et al.

Scheerens, J. (2013). School leadership effects revisited: Review and meta-analysis of empirical studies. Springer. Scheerens, J. (2016). Educational effectiveness and ineffectiveness: A critical review of the knowledge base. Springer. Schmidt, W., & Valverde, G. A. (1995). National policy and cross national research: United States participation in the third international and science study. Michigan State University. Schmidt, W. H., Jorde, D., Cogan, L. S., Barrier, E., Gonzalo, I., Moser, U., et al. (1996). Characterizing pedagogical flow. Kluwer Academic Publishers. Schulz, W., Carstens, R., Losito, B., & Fraillon, J. (2018). ICCS 2016 technical report – IEA international civic and citizenship education study 2016. International Association for the Evaluation of Educational Achievement (IEA). Retrieved from https://www.iea.nl/sites/ default/files/2019-07/ICCS%202016_Technical%20Report_FINAL.pdf Seidel, T., & Shavelson, R. J. (2007). Teaching effectiveness research in the past decade: The role of theory and research design in disentangling meta-analysis research. Review of Educational Research, 77, 454–499. Shechtman, N., Roschelle, J., Haertel, G., & Knudsen, J. (2010). Investigating links from teacher knowledge, to classroom practice, to student learning in the instructional system of the middleschool mathematics classroom. Cognition and Instruction, 28(3), 317–359. Stigler, J., & Hiebert, J. (1999). The teaching gap. The Free Press. Tatto, M. T., Schwille, J., Senk, S. L., Ingvarson, L., Rowley, G., Peck, R., et al. (2012). Policy, practice, and readiness to teach primary and secondary mathematics in 17 countries: Findings from the IEA teacher education and development study in mathematics (TEDS-M). International Association for Educational Achievement (IEA). van der Werf, G., Opdenakker, M.-C., & Kuyper, H. (2008). Testing a dynamic model of student and school effectiveness with a multivariate multilevel latent growth curve approach. School Effectiveness and School Improvement, 19(4), 447–462.

Part IV Meta-perspectives on ILSAs: Characteristics of ILSAs

Overview of ILSAs and Aspects of Data Reuse

13

Nathalie Mertes

Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Agencies, Studies and Cycles, and Participating Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ILSA Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Domains of Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Target Populations and Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ILSA International Databases and Aspects Related to Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . Data Reuse (Secondary Analyses) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

280 280 285 286 289 292 294 297 299 301

Abstract

The landscape of international large-scale assessments (ILSAs) in education has constantly grown since the implementation of the first study in the early 1960s. This chapter provides an overview of recent school-related ILSAs, i.e., ICCS, ICILS, PASEC, PIRLS, PISA, SACMEQ, TALIS, TERCE, TIMSS, and TIMSS Advanced, including a description of common features and major differences, and addresses the reuse of ILSA data in order to conduct secondary analyses. Topics include objectives of ILSAs, domains of investigation, study sizes in terms of participating entities (countries, economies, regions), target populations and samples, techniques and levels of data collection, the construction and contents of the international databases, and methods of data analysis, with a focus on those relevant for data reuse. Time issues are also addressed, including first implementation, frequency of data collection, and number of cycles.

N. Mertes (*) RandA, IEA, Hamburg, Germany e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_14

279

280

N. Mertes

The chapter suggests various ways of classifying ILSAs, using different tables to facilitate fast insights. The initiators, responsible for the development and conduct of these ILSAs, i.e., the CONFEMEN, IEA, OECD, SEACMEQ, and UNESCO, are briefly described and the measures they take for supporting secondary analyses. In this context, the first centralized platform, the ILSA Gateway, is presented as a source that facilitates access to information and materials from all ILSAs and has the potential to encourage and inspire knowledge exchange, discussions, and future research. The chapter concludes with a brief outline of some remaining gaps and challenges. Keywords

Large-scale assessment · ILSA · Education research · Data reuse · Secondary analysis

Introduction This handbook focuses on ILSAs investigating primary and secondary rather than early childhood and pre-primary education or lifelong learning. Therefore, the present chapter presents school-related ILSAs exclusively, i.e., studies with target populations in schools. As there are already comprehensive overviews available for the previous periods, from 1964 to 2011 (Chromy, 2002, expanded by Heyneman & Lee, 2014), this chapter concentrates on studies and cycles conducted in the last decade since 2010. It provides an overview of studies and covers aspects related to the reuse of ILSA data. The overview starts with the identification of ILSA agencies, a concise historical outline of studies and cycles, and numbers of participating entities, followed by a description of general ILSA objectives. Then, domains of investigation as well as target populations and samples are compared. Types and levels of data collection, the construction and contents of the international databases, and methods of data analysis are described afterward. All these topics offer important insights for researchers wanting to reuse ILSA data. However, the chapter includes dedicated subsections describing measures taken by ILSA agencies to support secondary analyses. It concludes with some general suggestions for future research.

Agencies, Studies and Cycles, and Participating Entities International large-scale assessments have become a regular feature of the education research landscape, but they are still a rather recent phenomenon. Their history starts in the first half of the 1960s, even if discussions about their need and feasibility began decades earlier (Heyneman & Lee, 2014). In 1958, the International Association for the Evaluation of Educational Achievement (IEA) was founded as the result of a meeting of scholars at the United Nations Educational, Scientific and Cultural

13

Overview of ILSAs and Aspects of Data Reuse

281

Organization (UNESCO) Institute for Education in Hamburg, Germany, with the aim of conducting comparative studies in schools across countries worldwide (Wagemaker, 2014). After their Pilot Twelve-Country Study implemented in 1960, the IEA started formal testing initiatives with the First International Mathematics Study (FIMS) in 1964 (IEA, n.d.-a). Since then, numerous ILSAs have been undertaken inside and outside the school context. Many of them have been conducted with participants from throughout the whole world. However, there are transnational studies, focusing on regions and/or languages. All of them will be taken into account in this chapter; Table 1 provides an overview. As far as ILSA agencies are concerned, the IEA, an international nongovernmental and nonprofit research organization, has remained an important player in the field, but there are other organizations conducting ILSAs in the school context: CONFEMEN, the Conference of the Ministers of Education of French-speaking countries (in French: Conférence des Ministres de l’Education des Etats et Gouvernements de la Francophonie), has become one of them, and the OECD, short for the Organisation for Economic Co-operation and Development, an intergovernmental organization, then SEACMEQ (previously SACMEQ), the Southern and Eastern Africa Consortium for Monitoring Educational Quality, an international nonprofit developmental organization of 16 ministries of education, and also the UNESCO, which is part of the intergovernmental United Nations organization. Beyond the initiating and supporting agency, many institutions and people contribute to the successful development and implementation of an ILSA, a combination of several of the following: a governing board, a steering committee, study directors, advisory boards or committees, NPMs (national project managers, e.g., in OECD studies) or NRCs (national research coordinators, e.g., in IEA studies), an international consortium (gathering various companies, organizations, and institutes with specific knowledge and skills), expert committees or groups, and external research consultants. Depending on the background, culture, and design of a study, different terms are used for describing study participants: countries, economies, education or school systems, and benchmark participants, referring to regional jurisdictions of countries (e.g., cities, states, or provinces). Numbers of participating entities have increased considerably from 12 in the IEA’s FIMS in 1964 up to more than 60 or around 80 in recent IEA or OECD studies, respectively. At present, about 70% of the countries in the world are estimated to participate in ILSAs (Lietz et al., 2017). All these ILSAs adopted a cyclical, trend approach, according to which a study is repeated at more or less regular intervals. A whole study cycle takes several years to conduct, typically, 3–6 years, from framework and instrument development through field testing, data collection, and data analysis to the release of results and databases. The year that is used to denominate a cycle usually refers to the main period of data collection. Numerous countries participate in various ILSAs, either in parallel or alternately, considering them complementary approaches to the investigation of their education systems (Wagemaker, 2014). Details related to time, participants, and other aspects about these recent studies are presented in the next paragraphs, in alphabetical order of ILSA acronyms.

282

N. Mertes

Table 1 Overview of school-related ILSAs since 2010 Agency CONFEMEN IEA IEA

Study acronym and long title PASEC: Programme d’analyse des systèmes éducatifs de la CONFEMEN ICCS: International Civic and Citizenship Education Study ICILS: International Computer and Information Literacy Study

Cycles 2014 2016 2013

2018

IEA

PIRLS: Progress in International Reading Literacy Study

2011

2016

IEA

TIMSS: Trends in International Mathematics and Science Study

2011

2015

2019

IEA OECD

TIMSS Advanced: Trends in International Mathematics and Science Study Advanced PISA: Program for International Student Assessment

2015 2012 2015 2018

OECD

TALIS: Teaching and Learning International Survey

2013 (round I) 2014 (round II) 2018

SEACMEQ

SACMEQ: Southern and Eastern Africa Consortium for Monitoring Educational Quality Project TERCE: Tercer Estudio Regional Comparativo y Explicativo

2013

UNESCO

2013

Participating entities 10 countries 24 education systems 18 countries 3 benchmarking participants 12 countries 2 benchmarking participants 49 countries 9 benchmarking participants 50 countries 11 benchmarking participants 63 countries 14 benchmarking participants 57 countries 7 benchmarking participants 64 countries 8 benchmarking participants 9 countries 65 countries and economies 72 countries and economies 79 countries and economies 38 countries and economies 4 countries and economies 48 countries and economies 13 school systems 15 countries

13

Overview of ILSAs and Aspects of Data Reuse

283

ICCS, short for International Civic and Citizenship Education Study, is conducted by the IEA. In 2009, ICCS “was established as a baseline study for future assessments” (Schulz et al., 2016, p. 2), although civic education had been one of the domains addressed already in the early IEA Six Subjects Study (1970–1971) and had again been investigated in the second half of the 1990s in the CIVED study. In addition to the main assessment, ICCS offers three regional modules for Asia, Europe, and Latin America. In the 2016 cycle, 15 education systems participated in the regional European module, 5 systems in the regional Latin American module, and 24 systems participated in the international study (Schulz et al., 2017). The next cycle is ICSS 2022. ICILS, the International Computer and Information Literacy Study, which is also conducted by the IEA, was implemented for the first time in 2013. Again, there had been precursor IEA studies, the Computers in Education Study (COMPED), conducted in 1989 and 1992, and the Second Information Technology in Education Study (SITES), conducted in 1998–1999 (Module 1), 2001 (Module 2), and 2006. In the first (2013) cycle of the ICILS format, 18 countries and 3 benchmarking participants took part and in the second (2018) cycle, 12 countries and 2 benchmarking participants (Fraillon et al., 2017). ICILS 2023 will be the third cycle. PASEC, short for Programme d’analyse des systèmes éducatifs (in English: Programme for the Analysis of Education Systems) is conducted by the CONFEMEN in French-speaking African countries. It was first implemented in 2014 with ten participating countries. The study will take place every 5 years (Hounkpodote et al., 2017) and the second (2019) cycle is currently ongoing. PIRLS, the Progress in International Reading Literacy Study, is another IEA study. Reading was one of the domains investigated in the Six Subjects Study, but in 2001, PIRLS was first implemented as a follow-up to the IEA’s Reading Literacy Study (1991). PIRLS is conducted every 5 years and offers some optional versions, for example, PIRLS Literacy (previously prePIRLS), a less difficult version of PIRLS, and ePIRLS, an assessment of online reading. In PIRLS 2011, 9 benchmarking entities and 49 countries participated. Numbers of participants for PIRLS 2016 and its options are as follows: 50 countries and 11 benchmarking entities for PIRLS, 5 countries and 1 benchmarking entity for PIRLS Literacy, and 14 countries and 2 benchmarking entities for ePIRLS (Hooper et al., 2017b). Next is PIRLS 2021, the fifth cycle. PISA, the Program for International Student Assessment, is an OECD study. It was first conducted in 2000 with a triennial frequency. As a result of the COVID-19 pandemic, the eighth cycle is taking place after 4 years, PISA 2022. In addition to regular PISA, the OECD initiated PISA for Development (PISA-D) in 2014 as a “oneoff pilot project, spanning six years, . . . to make the assessment more accessible and relevant to low-to-middle-income countries” (OECD, 2018a, p. 4). Regular PISA has seen continuous increases in participants, from 65 countries and economies in the 2012 cycle, to 72 countries and economies in the 2015 cycle, to 79 countries and economies in the 2018 cycle (Avvisati, 2017). The SACMEQ Project is conducted by SEACMEQ (previously SACMEQ), both short for the Southern and Eastern Africa Consortium for Monitoring Educational Quality. The study was implemented in 1997 with an irregular frequency, and the

284

N. Mertes

following cycles conducted in 2000, 2007, and 2013, the latter in 13 school systems (Department of Basic Education, Rep. of South Africa, 2017). Information about the next cycle is not yet available. TERCE, short for Tercer Estudio Regional Comparativo y Explicativo (in English: Third Regional Comparative and Explanatory Study), is a study conducted by the UNESCO in Latin America and the Caribbean. The first cycle was conducted in 1997, the second in 2006, and the third in 2013, the latter with a total of 15 countries (Viteri & Inostroza Fernández, 2017). With the ongoing (2019) cycle, the name changes to ERCE (Estudio Regional Comparativo y Explicativo, in English: Regional Comparative and Explanatory Study). TALIS, the Teaching and Learning International Survey, is another OECD study. It was first implemented in 2008 with a frequency of 5–6 years. The second cycle was conducted in 2 phases, the first one in 2013 with 38 countries and economies and the second one in 2014 with 4 countries and economies. In 2018, 48 countries and economies participated (Tremblay et al., 2017). The next cycle is scheduled for 2024. TIMSS, the Trends in International Mathematics and Science Study, conducted by the IEA, was first implemented in 1995, under the title of “Third International Mathematics and Science Study,” after a series of separate math- and science-related IEA studies (in the early 1970s and 1980s). The title changed to “trends” with the 2003 cycle. TIMSS takes place every 4 years, with the next (2023) cycle being the eighth. TIMSS offers several optional versions: a less difficult version of the mathematics part at grade 4, introduced with the 2015 cycle under the name of “TIMSS Numeracy” and continued in 2019 under the name of “Less Difficult TIMSS” and, beginning with the 2019 cycle, eTIMSS, a computerized, interactive version of TIMSS. In 2011, 63 countries and 14 benchmarking entities took part in TIMSS, numbers for 2015 are as follows: Overall, 57 countries and 7 benchmarking entities participated in TIMSS 2015, and 7 countries and 1 benchmarking entity participated in TIMSS Numeracy (Hooper et al., 2017c). In the recently published TIMSS 2019 cycle, overall, 64 countries and 8 benchmarking entities participated and 11 countries implemented the less difficult version in grade 4 (Mullis et al., 2020). TIMSS Advanced is another IEA study, first implemented in 1995 with periodic assessments. The second cycle was conducted in 2008. In 2015, for the third cycle, TIMSS Advanced was conducted together with regular TIMSS; nine countries participated in TIMSS Advanced (Hooper et al., 2017a). Information about the next cycle is not yet available. As this chapter presents school-related ILSAs exclusively, i.e., studies with target populations in primary and secondary schools, it does not address ILSAs dealing with early-childhood education, for example, PRIDI, short for Programa Regional de Indicadores de Desarrollo Infantil (in English: Regional Project on Child Development Indicators), a study undertaken by the Inter-American Development Bank (IDB), the Starting Strong Teaching and Learning International Survey (TALIS Starting Strong), and the International Early Learning and Child Well-being Study (IELS), both OECD studies, or ILSAs dealing with lifelong learning and noninstitutionalized target populations, for example, PIAAC, short for the OECD Programme for the International Assessment of Adult Competencies.

13

Overview of ILSAs and Aspects of Data Reuse

285

ILSA Objectives Educational ILSAs share some general objectives. Since the conduct of the first study, they have sought to contribute to the improvement of learning and teaching around the world both in terms of quality and equity, even if no common understanding of these concepts has been established yet (Lietz et al., 2017). ILSAs are designed as representative studies at the system (not the individual or institutional) level and help participating countries or regions to gain a better, more comprehensive understanding of their education efforts (Rutkowski et al., 2014). In line with the growing demand for educational accountability, for monitoring (Rutkowski et al., 2014), and for evidencebased policy (Lietz et al., 2017), ILSAs provide high-quality data and indicators. They support national policy dialogues, development, and decisions, they help to prepare and evaluate educational reforms, and they inform education practices (Wagemaker, 2014). ILSAs investigate preparedness for study, life, or work, usually in terms of knowledge and skills with regard to selected areas and established in achievement tests or surveys. For example, ICCS “investigate[s] the ways in which young people are prepared to undertake their roles as citizens” (Schulz et al., 2016, p. 1), and PISA “assesses the extent to which 15 year old students near the end of their compulsory education have acquired the knowledge and skills that are essential for full participation in modern societies” (OECD, 2019a, p. 3). In addition, these studies investigate relations (in terms of correlations not cause-and-effect relationships) between assessment or survey outcomes and contextual factors. Supporting the monitoring of trends at the system level is another major common purpose of ILSAs. They create and distribute evidence about a population and area(s) of interest for a specific point in time. With the vast majority of these studies being conducted at regular intervals, they allow repeatedly participating countries to track changes in their own results across comparable cohorts over time. ILSAs also provide data which are comparable across countries. Different from the headlines that appear in the news when the results of a new study cycle are released and tend to highlight rankings, ILSAs strive to encourage participating countries to learn from each other, by identifying both strengths and opportunities for improvement. Countries should learn from other countries with similar challenges about the impacts of differing policy approaches and from higher-performing countries about possible achievements within specific contexts (Lietz et al., 2017). Even if it is not stated as an explicit goal in the various studies, ILSAs have made important contributions to education research. They have supported the development, testing, and refinement of educational models and theories, of large-scale research methodologies (adapting and extending the methodologies developed for the early National Assessment of Educational Progress, NAEP, a long-term trend assessment in the USA), and of accompanying technologies. As far as study participants are concerned, ILSAs have inspired and supported the conduct of national and regional studies with capacity building initiatives (i.e., materials, training, and consulting) for all research phases, from framework development to report writing. Finally, ILSAs have helped to set up and strengthen a worldwide dedicated research community (Wagemaker, 2014).

286

N. Mertes

Within the scope of these general purposes and, possibly, beyond them, for each ILSA, several specific objectives are formulated. Study-related objectives, for some studies broken down to explicitly formulated research questions, shape the decisions that need to be taken along the research process, from the identification of the domains of investigation and the development of a theoretical and/or conceptual framework, through the elaboration and implementation of the research instruments, data analysis, and report writing to the dissemination of results.

Domains of Investigation International large-scale studies in education can include one or more domain(s) of investigation. However, a first distinction needs to be made between assessment domains and survey domains as the primary focus of an ILSA. An assessment typically refers to “a test (paper, electronic, or online format) for gaining information about an individual’s knowledge, skills, or understandings in a subject area or domain of interest” (Glossary, 2017). All studies reported here include an assessment, except TALIS, which uses a survey design, administering self-report questionnaires. TALIS investigates “teachers and teaching” and the 2018 cycle focused on two main domains, the “teaching profession (professional characteristics)” and “teaching and learning (pedagogical practices),” including the following themes: the institutional environment (e.g., “human resource issues and stakeholder relations” and “school leadership”), teacher characteristics (e.g., “teacher education and initial preparation” and “teacher feedback and development”), teacher (instructional and professional) practices, and the intersection of various themes at both the institutional and teacher level (including “innovation” as well as “equity and diversity”) (Ainley & Carstens, 2018). A second distinction needs to be made between assessment or survey domains and contextual or background domains, which are established in order to identify factors correlating with assessment or survey outcomes. Table 2 provides an overview of assessment domains for school-related ILSAs since 2010 (IEA, 2017). Mathematics, science, and reading have been the prevailing assessment domains. Terminology and concepts may vary from study to study, but aspects related to mathematical literacy, mathematics, or numeracy were studied in PASEC 2014; PISA 2012 (as the major domain), 2015, and 2018; SACMEQ IV 2013; TERCE 2013; TIMSS 2011, 2015, and 2019; and TIMSS Advanced 2015. Science or scientific literacy were addressed in PISA 2012, 2015 (as the major domain), and 2018, TERCE 2013, and TIMSS 2011, 2015, and 2019. TIMSS Advanced 2015 concentrated on one single science domain: physics. Reading or reading literacy were addressed in PASEC 2014; PIRLS 2011 and 2016; PISA 2012, 2015, and 2018 (as the major domain); SACMEQ IV 2013; and TERCE 2013, with the latter also covering aspects related to writing. The following definitions indicate how understandings of an assessment domain may differ from one study to another:

13

Overview of ILSAs and Aspects of Data Reuse

287

Table 2 Assessment domains of school-related ILSAs since 2010 Assessment domains Civics and citizenship Computer literacy/information literacy/computational thinking (*) Financial literacy Global competence HIV/AIDS knowledge Mathematical literacy/mathematics/numeracy

Problem-solving (**)/collaborative problem-solving (***) Reading/reading literacy

Science/scientific literacy

Tuberculosis (TB) knowledge Writing

Studies ICCS 2016 ICILS 2013, 2018 (*) PISA 2012, 2015, 2018 PISA 2018 SACMEQ IV 2013 PASEC 2014 PISA 2012 (major domain), 2015, 2018 SACMEQ IV 2013 TERCE 2013 TIMSS 2011, 2015, 2019 TIMSS Advanced 2015 PISA 2012 (**), 2015 (***) PASEC 2014 PIRLS 2011, 2016 PISA 2012, 2015, 2018 (major domain) SACMEQ IV 2013 TERCE 2013 PISA 2012, 2015 (major domain), 2018 TERCE 2013 TIMSS 2011, 2015, 2019 TIMSS Advanced 2015 SACMEQ IV 2013 TERCE 2013

• PIRLS 2016 referred to reading literacy as “the ability to understand and use those written language forms required by society and/or valued by the individual. Readers can construct meaning from texts in a variety of forms. They read to learn, to participate in communities of readers in school and everyday life, and for enjoyment” (Hooper et al., 2017b). • PISA 2018 defined it as “understanding, using, evaluating, reflecting on and engaging with texts in order to achieve one’s goals, to develop one’s knowledge and potential and to participate in society” (OECD, 2019a, p. 28). Within the same ILSA, concepts may and do also evolve from one cycle to another in order to implement lessons learnt from previous cycles, to include new research findings, and to adapt to recent (educational or technological) developments. However, with one of their major purposes being the monitoring of trends, ILSAs seek to maintain conceptual continuity over time. ILSAs have also investigated other assessment domains. Civics and citizenship was addressed in ICCS 2016. Computer and information literacy was investigated in ICILS 2013 and 2018, with the latter also including computational thinking (CT) for the first

288

N. Mertes

time and as an optional domain. CT refers to “an individual’s ability to recognize aspects of real-world problems which are appropriate for computational formulation and to evaluate and develop algorithmic solutions to those problems so that the solutions could be operationalized with a computer” (Fraillon et al., 2019, p. 3). Financial literacy was an optional assessment domain in all PISA cycles from 2012 to 2018. Problem-solving and collaborative problem-solving were part of minor assessment domains in the 2012 and 2015 PISA cycles, respectively, while PISA 2018 saw the introduction of the new global competence domain, instead, which was defined as follows: “Globally competent individuals can examine local, global and intercultural issues, understand and appreciate different perspectives and worldviews, interact successfully and respectfully with others, and take responsible action toward sustainability and collective well-being” (OECD, 2019a, p. 166). The assessment domains of HIV/AIDS knowledge and tuberculosis (TB) knowledge (the latter as a national option) were investigated in SACMEQ IV 2013. While all these studies were undertaken in the school context, they took different approaches. PISA focused on knowledge and skills, investigating “young people’s readiness for life beyond compulsory schooling and their ability to use their knowledge and skills to meet real-life challenges” (OECD, 2019a, p. 46). The other studies, those conducted by the IEA as well as PASEC, SACMEQ, and TERCE, opted for a curriculum approach, based “on the notion of ‘opportunity to learn’ in order to understand the linkages between the intended curriculum (what policy requires), the implemented curriculum (what is taught in schools) and the achieved curriculum (what students learn)” (IEA, n.d.-a). ILSAs also collect background or contextual data for three main purposes: first, to identify factors that correlate with differences in assessment or survey outcomes; second, “to measure affective or behavioural variables which are regarded as learning outcomes beyond cognitive performance in their own right”; and third, “to generate indicators of education systems other than students’ cognitive and attitudinal learning outcomes” (Lietz, 2017, pp. 93–94). Contextual domains of investigation may include several of the following: student characteristics, perceptions, attitudes, or activities; home and family backgrounds; aspects related to teachers and their teaching; classroom and school environments; wider community contexts; and characteristics of the (national) education system (IEA, 2017). For each ILSA, primary (assessment or survey) domains and secondary (background or contextual) domains are identified and described in a framework document. In addition to encompassing definitions of the various domains of investigation, framework documents typically include a presentation of the theoretical and/or conceptual foundations of the study, based on a discussion of the relevant research literature, a thorough description of the constructs that will be addressed, and detailed information about study instruments and item types. The development of a framework is a collaborative and iterative process at international level, led by a group of academic (subject) experts, engaging other qualified people, for example, experts with pedagogical knowledge in the respective domain or with experience in test development, as well as national representatives (Mendelovits, 2017).

13

Overview of ILSAs and Aspects of Data Reuse

289

Target Populations and Samples For describing the groups of individuals from which data are collected in order to study the various domains of investigation, a distinction needs to be made between the primary target population, the secondary target population, and (other) study participants. The primary ILSA target population refers to the first unit of analysis and the target of the assessment or the main survey. The secondary target population describes a second unit of analysis in its own right and for which results are reported separately. (Other) study participants are ILSA participants “linked” to the primary population and from which background information is collected in order to gain a better understanding of the phenomenon of interest. Students were the predominant primary target population in the past decade (Table 3), with all ILSAs, except TALIS, focusing on them. However, studies used different approaches for defining student target populations, either a grade-based or an age-based approach or a combination of these. PISA 2012, 2015, and 2018 used an age- or birth-based approach, assessing students that were 15 years old and enrolled in a school at grade 7 or higher. An important argument brought forward for this decision is that “these students are approaching the end of compulsory schooling in most participating countries, and school enrolment at this level is close to universal in almost all OECD countries” (OECD, 2017, p. 1). This is in

Table 3 Primary target populations in school-related ILSAs since 2010 Target populations Students

Details Grade 2 Grade 3 Grade 4

Last grade (5 or 6) of primary school Grade 6

15-year-old students enrolled in an educational institution at grade 7 or higher Grade 8

Teachers

Students in final year of secondary schooling (grade 11–13) and enrolled in advanced mathematics or physics courses, respectively Lower secondary school teachers

Studies PASEC 2014 TERCE 2013 PIRLS 2011, 2016 TIMSS 2011, 2015, 2019 PASEC 2014 SACMEQ IV 2013 TERCE 2013 PISA 2012. 2015, 2018 ICCS 2016 ICILS 2013, 2018 TIMSS 2011, 2015, 2019 TIMSS Advanced 2015 TALIS 2013, 2018

290

N. Mertes

line with the future-oriented approach of PISA, assessing students’ ability to apply in real-life situations the knowledge and skills developed at school. Other studies, those conducted by the IEA as well as PASEC, SACMEQ, and TERCE, used a grade-based approach, “assess[ing] students’ learning outcomes after a fixed period of schooling, and [being] fundamentally concerned with students’ opportunity to learn and learning outcomes” (Wagemaker, 2014, p. 14). PIRLS 2011 and 2016 assessed students in grades representing 4 years of formal schooling in the participating entities, counting from the first year of ISCED (International Standard Classification of Education) Level 1 (primary education). This referred to grade 4 in most countries and is “an important transition point in children’s development as readers, because at this stage most students should have learned to read, and are now reading to learn” (Mullis et al., 2012, p. 25). Also, in some countries, for example, in Germany, grade 4 marks the end of the first (primary) school cycle. However, if the average age of students in a participating entity was less than 9.5 years, students in the next higher grade, typically grade 5, were defined as the target population. ICCS 2016, ICILS 2013 and 2018, and TIMSS 2011, 2015, and 2019 assessed students in their eighth year of schooling, counting from the first year of ISCED Level 1. However, in all these studies, grade 9 students were assessed instead of grade 8 in participating entities where the average age of students in grade 8 was less than 13.5 years. TIMSS 2011, 2015, and 2019 had two main target populations, not only students with eight but also those with 4 years of schooling (typically, grade 4), providing participating entities with the option to investigate both populations or only one of them. For education systems participating every 4 years and assessing the grade 4 student cohort 4 years later at grade 8, TIMSS takes a quasi-longitudinal design; an opportunity to identify necessary curricular reforms (at grade 4) and to monitor their effectiveness 4 years afterward (at grade 8) (IEA, n.d.-a). The 2015 and 2019 TIMSS cycles also offered the opportunity to implement a less difficult version, TIMSS Numeracy or Less Difficult TIMSS, respectively, at the end of primary school. TIMSS Advanced 2015, on the other hand, assessed students enrolled in advanced mathematics or physics courses, respectively, in the final year of secondary school, which varied from grade 11 to 13 in participating countries (Martin et al., 2016). PASEC 2014 investigated students formally enrolled in grade 2, i.e., at the beginning of primary school, and in grade 5 or 6, i.e., at the end of primary school, allowing a “better analysis and understanding of the effectiveness and equity of education systems” (PASEC, 2015, p. VII). TERCE 2013 also investigated two populations in participating countries: students at grades 3 and 6 in elementary schools. Learning in mathematics, reading, and writing were assessed in both grades, learning in the natural sciences in grade 6 only (LLECE, 2015). The target population of SACMEQ IV 2013 was defined as “all learners at Standard 6 level in 2013 (at the first week of the eighth month of the school year) who were attending registered mainstream (primary) schools” (Chabaditsile et al., 2018, p. 6). The only ILSA undertaken in the last decade with teachers as the primary target population was TALIS (2013, 2018). In its core study, TALIS surveyed teachers at

13

Overview of ILSAs and Aspects of Data Reuse

291

ISCED Level 2 (lower secondary education) and offered various options focusing on other teacher target populations: teachers at primary school (ISCED level 1), at upper secondary school (ISCED level 3), and at schools sampled for PISA 2012 or 2018, respectively, an option referred to as “TALIS-PISA Link” (OECD, 2019b). For the first time in TALIS, its last cycle investigated principals as a separate, secondary target population: “the programme’s sampling coverage extend[ed] to all teachers of an ISCED level and to the principals of the schools in which these teachers [were] working” (OECD, 2019b, p. 97). Some other studies, ICCS 2016, both (2013 and 2018) ICILS cycles, and PISA, as an optional feature in 2015 and 2018, investigated teachers as secondary unit of analysis. For example, in ICCS 2016 the teacher target population included “all teachers teaching regular school subjects to students of the target grade (regardless of the subject or the number of hours taught) during the ICCS testing period who [had] been employed at school since the beginning of the school year” (Schulz et al., 2017). PISA defined this population as “ALL (sic) teachers that were eligible for teaching the modal grade – whether they were doing so currently, had done so before, or will/could do so in the future” (OECD, 2017, p. 86). All school-related ILSAs, except TALIS, also gathered information from other study participants, background information that helped to gain a deeper understanding of the phenomenon of interest, i.e., student learning in a specific domain. These other study participants comprised, basically, the teachers of the assessed students, for example, in TIMSS 2015 Advanced “the teachers of the advanced mathematics and/or physics classes sampled to take part in the TIMSS Advanced testing” (Mullis & Martin, 2014, p. 52) and/or the principals of the sampled schools, other relevant experts inside school, for example, information and communications technology (ICT) coordinators (in ICILS 2018), student parents, NRCs, and (other) education experts in the country (IEA, 2017). In these large-scale international assessments, it is neither intended nor possible to assess or survey all individuals which are part of a target population. Instead, a sample is selected in order to make inferences about the whole population. ILSAs typically use a two-stage sample design for identifying national samples, starting with a selection of schools, followed by a selection of students (or classrooms) and/or teachers within these schools. In order to achieve high-quality samples, the two-stage strategy is typically combined with other sampling techniques, for example, (explicit or implicit) stratification of the school sample, identification of the school sample using probability proportional to size (PPS) sampling, selection of student samples within schools with equal probability and equal-size sampling, identification of schools, and student samples using systematic sampling. Widely accepted norms for ILSA sample sizes are at least 150 schools and 4000 students per country, although many countries choose to use larger samples (Rust, 2014). For example, PIRLS 2016 used a stratified two-stage cluster sample design. At the first stage, schools were sampled employing probability proportional to size (PPS) and random-start fixed-interval systematic sampling, with the option of stratifying schools (explicitly or implicitly) according to important (demographic) variables (e.g., the region of the country, the school type or source of funding, or

292

N. Mertes

languages of instruction). At the second stage, classes within schools were sampled, i.e., one or more intact classes from the target grade of each school were selected using systematic random sampling (Hooper et al., 2017b). For the majority of countries, precision requirements were met with a school sample of 150 schools and a student sample of 4000 students (i.e., often one classroom per school) for each target grade. Overall, about 319,000 students, 310,000 parents, 16,000 teachers, and 12,000 schools participated in PIRLS 2016 (Mullis et al., 2017).

Data Collection ILSAs typically employ predominantly quantitative research designs, using two main approaches of data collection: proctored assessments of achievement and self-administered surveys. All studies reported here, except TALIS, administered an achievement test, all of them in order to assess student outcomes in their specific domain(s) of investigation. Because they are intended to measure trends while taking into account recent developments in research and technology, ILSA assessments comprise a combination of existing and newly developed items. PIRLS, for example, relies on the following strategy: “The design for passage/item replacement provides for each assessment to include passages and items from three cycles—essentially, one-third newly developed, one-third from the previous cycle, and one-third from two cycles before” (Martin et al., 2016, p. 1.3). As they are investigations at the system (not individual) level and in order to assess broad sets of constructs while limiting testing time for each student, ILSA assessments typically take an incomplete, rotated (booklet) design, with each student working only on a limited number of items, based on matrix sampling, i.e., the “systematic or random selection of items from an overall pool of available items” (Glossary, 2017). For example, the regular PIRLS 2016 assessment included 175 items, combined in 12 item blocks, and presented in 16 test booklets. Each booklet consisted of two assessment blocks with each block being used in more than one booklet. Each student was invited to complete one booklet and provided 40 minutes to complete each part of the test, with a 30-minute break between the two parts (Martin et al., 2017). Tests were either administered as paper-based assessments (e.g., ICCS 2016, PASEC 2014, SACMEQ IV 2013, TERCE 2013, TIMSS 2015, and TIMSS Advanced 2015), as paper-based assessment per default with computer-based and/or online versions as an option (e.g., PIRLS 2016 and TIMSS 2019), as computer-based versions per default with paper-based versions as an option (e.g., PISA 2018), or as mere computer-based versions (ICLS 2016). The (main) survey in TALIS 2018 was either administered online or on paper, or both. Based on technological developments, the administration of computer-based or online instruments is going to intensify in upcoming cycles and will allow the implementation of new, partly more precise instruments. For example, while TIMSS 2019 marked the beginning of the transition to the computerized, interactive version called “eTIMSS,” allowing participating entities to opt for the electronic or the paper

13

Overview of ILSAs and Aspects of Data Reuse

293

version or both, the 2023 TIMSS cycle will be conducted as a digital assessment, offering a paper-based version only with trend items and for participating entities solely that are unable to implement the digital assessment. The ePIRLS assessment is going to become digitalPIRLS in the 2021 cycle, an optional digital online alternative to PIRLS, and ICCS 2022 will offer a computer-based assessment option (IEA, n.d.-a). PISA is most likely going to refine the computer-based multistage adaptive design, which was developed for the major reading domain in the 2018 cycle, and to expand it to other assessment domains. Also, the 2025 PISA cycle will include optional assessment tasks in an open-ended, digital learning environment under the title of “Learning in the Digital World” (OECD, 2018c). All studies that included an assessment used surveys, predominantly self-administered questionnaires, to gather contextual or background data. Table 4 provides an overview for the most recent completed and published cycles since 2010. All studies implemented questionnaires for students, teachers, and the school principal, except PISA 2018, where the teacher questionnaire was optional. ICILS 2018 also included an ICT coordinator questionnaire. In SACMEQ IV 2013, completion of the student questionnaire required parent support so that it was taken home (Chabaditsile et al., Table 4 Collection of contextual data in most recent cycles of school-related ILSAs Study ICCS 2016 ICILS 2018 PASEC 2014 PIRLS 2016

PISA 2018

SACMEQ IV 2013 TERCE 2013 TIMSS 2019

TIMSS Advanced 2015

Contextual or background surveys Student, teacher, and school (principal) questionnaires National context questionnaire (staff in national research centers) Student, teacher, school principal, and ICT coordinator questionnaires National context questionnaire (staff in national research centers) Pupil, teacher, and headmaster questionnaires Student, teacher, and school (principal) questionnaires Curriculum questionnaire (staff in national research centers) Learning to Read Survey (home questionnaire) Descriptive encyclopedia chapters Student and school principal questionnaires Optional: ICT familiarity questionnaire (students) Well-being questionnaire (students) Educational career questionnaire (students) Parent questionnaire Teacher questionnaire Financial literacy questionnaire (students) Pupil (and parent) questionnaire School head and teacher questionnaires Student, teacher, principal, and parent questionnaires Student, teacher, and school questionnaires Early Learning Survey (home questionnaire, grade 4) Curriculum questionnaire (staff in national research centers) Descriptive encyclopedia chapters Student, teacher, and school principal questionnaires Curriculum questionnaire (staff in national research centers) Descriptive encyclopedia chapters (in TIMSS 2015 encyclopedia)

294

N. Mertes

2018). Parent or home questionnaires were used in PIRLS 2016 under the title of “Learning to Read Survey,” in TERCE 2013, and in TIMSS 2019 in grade 4 only and labeled “Early Learning Survey.” In PISA 2018, the parent questionnaire was one of the options together with several optional student questionnaires, i.e., the “ICT familiarity questionnaire,” the “well-being questionnaire,” the “educational career questionnaire,” and the “financial literacy questionnaire” (OECD, 2019a). Background questionnaires were administered on paper, as computer-based or online versions, or with the possibility for countries to combine these modes. The IEA studies also administered a “national context questionnaire” (ICCS 2016, ICILS 2018) or a “curriculum questionnaire” (PIRLS 2016, TIMSS 2019, and TIMSS Advanced 2015), both to be completed by staff in national research centers. In addition, PIRLS 2016, TIMSS 2019, and TIMSS Advanced 2015 asked each country to provide a qualitative, “descriptive encyclopedia chapter” with detailed information about its education system, focusing on the respective domain(s) of investigation; for TIMSS 2015 Advanced, country information was included in the TIMSS 2015 encyclopedia (IEA, 2017). While the development of the assessments and surveys is a collaborative process at international level, responsibilities for the implementation of the instruments typically lie with each participating entity, i.e., the appointed national research centers and organizations. In order to ensure the collection of high-quality data that are internationally comparable, ILSAs develop standardized procedures for the various phases of the process, supported by numerous guidelines, recommendations, (tracking) forms, manuals, software, hands-on training, and support and also complemented by encompassing quality assurance measures and controls (IEA, 2017; OECD, 2017).

ILSA International Databases and Aspects Related to Data Analysis A major aim of each ILSA is to provide an international database “that allows for valid within-and-cross-country comparisons and inferences to be made” (OECD, 2017, p. 10). These databases help to meet other requirements, such as transparency and replicability of results, public access to study data, a prerequisite for the funding ILSAs typically receive from public and government bodies or donors, and the facilitation of secondary, in-depth data analyses. After data collection, work typically continues at the national level at first, with NRCs or NPMs, respectively, being responsible for preparing complete and accurate national data files or databases. Often together with external contractors, they execute, among others, data checks for missing or additional records and inconsistencies, recoding of adapted variables, if applicable, coding and scoring of open-ended response questions, data entry into specialized software, and/or data extraction, if computer-based assessment (CBA) software was used. They also implement the respective verification and validation procedures on all entered data (Gebhardt & Berezner, 2017). Finally, national centers submit their data files or database to a central organization that processes the data from all participating entities.

13

Overview of ILSAs and Aspects of Data Reuse

295

ILSA data sets are highly complex, a natural consequence of the complexity of their study designs, i.e., (booklet) rotation at assessment tests, demanding multilevel sampling designs according to which data sources are partly inter-related, and the fact that data is not only collected from different sources but also at different times and in different formats (Gebhardt & Berezner, 2017). Therefore, the preparation of data sets for analysis and the construction of the international database require specialized methods and procedures. Preparations for establishing the international database are typically executed by the central organization in close collaboration with the national centers; it is an iterative process with several (international and national) review processes. One of the first steps in this process is data cleaning. Major goals involved in data cleaning at this level were described for ICCS 2016 as follows: “All information in the database conformed to the internationally defined data structure; the content of all codebooks and documentation appropriately reflected national adaptations to questionnaires; and all variables used for international comparisons were comparable across countries” (Schulz et al., 2018, p. 87). When the cleaning process is completed, student sampling weights are calculated. PISA 2015, for example, “calculated survey weights for all assessed, ineligible and excluded students, and provided variables in the data that permit users to make approximately unbiased estimates of standard errors, conduct significance tests and create confidence intervals appropriately” (OECD, 2017, p. 8). Afterward, scaling is conducted for both test and questionnaire responses. Typically, item response theory (IRT) is employed, which allows to “relate observed categorical variables, such as responses to test items, to hypothesised unobservable latent traits, such as proficiency in a subject area” so that “scale scores can be produced that are interval in nature . . . [and] comparable over time and settings” (Berezner & Adams, 2017, p. 324). In order to calculate proficiency estimates (of a whole population, not individuals), ILSAs use plausible value methodology. Typically, five plausible values are generated for each (student) proficiency measured in an assessment and included in the database to facilitate secondary analyses (Gebhardt & Berezner, 2017). Analytical procedures for PISA 2015, for example, included the following: • All statistics were computed using sampling weights; standard errors based on balanced repeated replication weights were used for statistical significance and/or confidence intervals; • Analyses based on achievement test results (plausible values) were based on Rubin’s rule for multiply imputed variables; • The OECD average corresponded to the arithmetic mean of the respective country estimates (Avvisati, 2017) These are only a few of the many “specialized statistical analysis methods” that these highly complex data require and that need to be taken into account when conducting (primary and secondary) analyses (Rutkowski et al., 2014, p. 5). ILSA international databases contain various types of data files, those related to the assessment (i.e., student achievement data files) or the main survey as well as

296

N. Mertes

background questionnaire data files (e.g., student, home or parent, teacher, and school principal data files), all of these for all participating entities and in different formats, usually SPSS and SAS. Databases, i.e., compressed versions of data files, are released along with encompassing support materials. For example, the PIRLS 2016 international database includes the following: • A user guide with three supplements: 1. The international version of all PIRLS 2016 background questionnaires 2. All national adaptations to international background questionnaires 3. Variables derived from the student, home, teacher, and school questionnaire data • The PIRLS 2016 item information files, IRT item parameters, and item percent correct statistics • PIRLS 2016 student, home, teacher, and school data files in both SPSS and SAS format • PIRLS 2016 curriculum questionnaire data files • Codebook files describing all variables in the PIRLS 2016 international database • Data almanacs with summary statistics for all PIRLS 2016 items and background variables • SPSS and SAS Programs (Foy, 2018). ILSAs typically prepare two different versions of their databases: One for general use, containing public-use files (PUF), which are published without variables that have the potential to disclose confidential information about participants, available for immediate download. Another one, containing restricted-use files (RUF), destined to researchers interested in conducting secondary analyses, including the respective sensitive files, and accessible only with explicit permission from study agencies or their representatives. PUV versions of ILSA databases are available online; for the studies reported here, they can be found on the following websites: • Databases of all IEA studies are available on the IEA website at https://www.iea. nl/data. In addition, PIRLS, TIMSS, and TIMSS Advanced databases can be accessed on the TIMSS and PIRLS International Study Center (Boston College) website at https://timssandpirls.bc.edu/databases-landing.html. • The PASEC database is available at http://www.pasec.confemen.org/donnees/. • PISA databases are available from the official OECD website at http://www.oecd. org/pisa/data/. • SACMEQ databases can be found at http://www.sacmeq.org/. • TALIS data are available on the OECD website, at http://www.oecd.org/ education/talis/talis-2013-data.htm for TALIS 2013 and at https://www.oecd.org/ education/talis/talis-2018-data.htm for TALIS 2018. • The TERCE database can be found at the UNESCO Santiago website at https:// en.unesco.org/fieldoffice/santiago.

13

Overview of ILSAs and Aspects of Data Reuse

297

Results of (primary) data analysis are usually reported at international level, by the ILSA agency together with the directors of the respective study cycle. In addition, separate national reports are produced and published by participating entities themselves, with the support of the study directors and/or agency. Reports of results are accompanied by comprehensive technical documentation that offers critical guidance for data interpretation and the implementation of secondary analyses.

Data Reuse (Secondary Analyses) In the information science and, more specifically, the digital curation literature, a widely accepted, broad definition refers to data reuse “as the use of data collected for one purpose to study a new problem” (Zimmerman, 2008, p. 634). ILSAs generate massive amounts of data, but in the first published reports, only (major) results for selected aspects can be published. As a consequence, there are plenty of opportunities for reusing the data and conducting further analyses. However, data reuse does not occur as a matter of course. Investigating what contextual information researchers need when deciding whether to reuse existing data, Faniel, Frank, and Yakel (2019) found 12 types organized in 3 categories: data production information, repository information, and data reuse information. As far as information about data production is concerned, researchers wanted details about data collection, specimen and artifacts, the data producer, data analysis, missing data, and research objectives. Repository information that researchers needed included details about data provenance and the creation of the database, about the reputation and history of the repository, as well as descriptions of data curation and digitization procedures and characteristics. Data reuse information that researchers needed were reports about prior reuse, preferably published in peer-reviewed journals and in the form of data critiques; advice on reuse offered in workshops, courses, and documentation; as well as information about the terms of use of the data and repository. ILSA agencies provide most of these types of information. Explanations about data production are typically presented in detailed methodological and technical documents, while repository information can also be found in user guides, manuals, and supplements published along with the international database and partly also on dedicated pages on study websites. These materials also offer advice for data reuse, emphasizing the analytical steps that are necessary given the complexity of ILSA study designs. For PIRLS 2016, for example, explanations indicate that “all required statistics should be calculated separately for each of the [five] plausible values and the resulting statistics averaged to get the final result,” that “standard errors for significance testing of achievement results should be computed using the Jackknife repeated replication method (JK2 variant) to estimate sampling variance and differences between the five plausible values to estimate measurement variance” and a reminder that “PIRLS data are designed for group-level reporting [and that] students’ scale scores in the database are not intended to report performance of individual students or very small groups” (Hooper et al., 2017b). For PISA, two versions of the

298

N. Mertes

PISA Data Analysis Manual are available, one for SAS and one for SPSS. They offer detailed information not only about the PISA international database itself together with “worked examples” but also explanations about “the statistical theories and concepts required to analyse the PISA data, including full chapters on how to apply replicate weights and undertake analyses using plausible values” (OECD, 2009). For analyzing ILSA data, specific macros and software packages have been developed, taking into account the peculiarities of ILSA designs and implementing the required analytical techniques. They are used in primary analyses and freely available also for conducting secondary analyses. The IEA IDB Analyzer cannot only be used with data from all IEA studies but also with data from most other largescale assessments, including those conducted by the OECD, for example, PISA and TALIS. The IDB Analyzer “produces SPSS or SAS syntax that can be used to combine and analyze data from across different countries and sources (for example, student, teacher, or school)” and “supports the creation of tables and charts and the calculation of averages, percentages, standard deviations, percentiles, linear and logistic regression, and performance levels” (IEA, n.d.-b). For PISA, the PISA Data Analysis Manual provides SAS or SPSS codes, respectively, and user-written software is also available in Stata and R. The OECD developed the PISA Data Explorer (OECD, 2018b) for conducting customized analyses, exporting data, and building reports. TERCE data can be analyzed with PISA macros, although names and variables have to be adapted. For performing secondary analyses of PASEC 2014 data, macro programs are available in Stata. An important more general and web-based analytical tool is the International Data Explorer (IDE), originally developed by the Educational Testing Service (ETS) and available on the website of the National Center for Education Statistics (NCES) in the USA. The IDE offers deployments for some of the school-related ILSAs, i.e., PIRLS, PISA, TALIS, and TIMSS, with data from all participating entities and allowing researchers to run various types of statistical tests for exploring results from both achievement tests and background questionnaires (IEA, 2017). Data reuse advice for researchers wanting to conduct in-depth analyses of ILSA data is also offered in the form of workshops, from more general sessions at international research conferences to specialized on-demand and onsite courses for whole organizations or smaller teams. Topics may include background knowledge of ILSA research designs, for example, the complex sampling techniques or item development and test design. Statistical methods are other important topics, basic to advanced, i.e., from an introduction to quantitative data analysis to multilevel modeling with ILSA data or relevant theories, such as IRT. Typically, workshop participants have the opportunity to familiarize themselves with the respective international database and to perform analyses for answering their own research questions, using specialized macros and software. Interpretation of ILSA results, academic (report) writing, and dissemination of results for different audiences are other topics covered in workshops (Gebhardt & Berezner, 2017; IEA, n.d.-b). Another, rather recent, initiative for facilitating secondary analyses, especially those involving data sets of more than one study, was the development of the ILSA Gateway available at https://ilsa-gateway.org/ and launched in 2017. Designed as a

13

Overview of ILSAs and Aspects of Data Reuse

299

central point of entry to all ILSAs, it offers comprehensive information about all studies in a standardized format while maintaining their individual culture and characteristics. Topics covered include fact sheets, allowing a quick overview of each study, frameworks, (research) designs, results, data, (supporting) materials, and the organization(s) behind each study. For each ILSA, textual information is complemented by hyperlinks to resources on the respective study website. In line with the finding that researchers’ intentions to reuse data correlate positively with information about prior reuse (Faniel et al., 2019), the Gateway also includes a database of ILSA-related research papers, most of them published in peer-reviewed journals. The platform was developed and is maintained by the IEA on the initiative of NCES, but the content for each study page was generated in close collaboration with the respective study directors and agency (IEA, 2017). The Gateway not only is a useful tool for new researchers in the field but for researchers in general, especially in the early phase of a new research project, when gaining an overview and understanding of the various educational ILSAs, preparing research questions, and taking decisions about the selection of data sets. Given the fact that the accessibility of data has a strong relationship with researchers’ data reuse intentions (Kim & Yoon, 2017) and satisfaction (Faniel et al., 2016), respectively, it may be considered as a shortcoming that the Gateway provides links to data on the individual study websites only rather than a shared repository. Even if setting up a common ILSA repository with data sets from all studies and all cycles in various formats requires in-depth analyses of user needs, numerous discussions, and thorough preparations, it might be worth the endeavor.

Conclusion ILSAs have achieved a lot; however, there is room for refinement and expansions. This concluding section identifies some gaps in terms of assessment domains, studied populations, and research methodologies. Further, it addresses some challenges related to data reuse as well as the interpretation of results and their usage in order to shape practice, regardless of whether outcomes are derived from primary or secondary data analyses. Since their start 60 years ago, school-related ILSAs have studied predominantly the domains of mathematics, reading, and science. Other domains of investigation, also with a longer research history, include civics and citizenship as well as computer and information literacy. More recent domains, which appeared partly as a consequence of recent societal and/or technological developments, comprised financial literacy, (collaborative) problem-solving, computational thinking, and global competence. Health-related topics of regional interest that were addressed encompassed HIV/AIDS knowledge and TB knowledge. There are several possibilities of expansion. For example, foreign languages could (re-)appear as domains of investigation, as they have gained in importance with recent migration trends, or the arts, highlighting other types of knowledge and skills. Also, and probably even more challenging

300

N. Mertes

but not less relevant, topics related to technical and vocational schooling are waiting to be put on ILSA agendas. As far as investigated populations are concerned, a clear focus has been on students. Given the well-established importance of teachers and principals for student learning (Hattie, 2009), it is time to address (more) aspects related to them in large-scale studies. This could happen as part of existing ILSAs, by adding some explicit research questions and defining teachers and/or principals as secondary units of analysis. A shortcoming of this approach is that it would remain an investigation in a specific domain or subject only. Also needed are more general studies about teachers and principals in separate, dedicated ILSAs with them as primary units of analysis. ILSAs predominantly use quantitative research designs, with some qualitative elements, for example, descriptive encyclopedia chapters in PIRLS and TIMSS. Considering both as complementary rather than competing methodologies and aiming at gaining a deeper understanding of the world of study participants as they experience it, additional, alternative approaches are needed. For example, some qualitative research questions could be meaningfully added to the existing quantitative research questions to shape the whole research process accordingly so that mixed methods studies are conducted. Another, not less demanding, approach would be to set up separate qualitative large-scale investigations. Of course, making use of qualitative data collection techniques and instruments would not be enough. Instead, a fundamental shift in thinking is needed, from (post)positivist, objectivist ontological and epistemological beliefs to interpretivist and subjectivist views, with consequences for all phases of the research process. ILSA agencies have taken important measures to support data reuse, providing comprehensive explanations and advice, specific software, and adequate training. However, data reuse does not happen as a matter of course. The Gateway, the first central ILSA platform, helps researchers interested in conducting secondary analyses by presenting important information about all studies and facilitating access to data and other resources. Additional efforts may be needed for the Gateway to uncover its full potential for the ILSA research community, both in terms of enhancing knowledge exchange and inspiring future research. Whether starting a project reusing ILSA data or setting up a new ILSA cycle or even a completely new study, both of which require the collection of new data, a major challenge remains the development of the theoretical framework. With their usage of quantitative, deductive designs and from a mere scientific perspective, the framework needs to be based on previous research identified in a thorough and comprehensive literature review. This has been realized in ILSAs and projects of secondary analyses to differing degrees. There is a need to move beyond referring to the frameworks and results of previous ILSA cycles. All constructs and aspects covered in a quantitative study, regardless of its size, should be based on research, including recent investigations and important findings from other disciplines, if applicable. Other major challenges are related to the interpretation and usage of results. Most school-related ILSAs assess student preparedness for study, life, or work in terms of knowledge and skills for selected areas, describing results at different levels of

13

Overview of ILSAs and Aspects of Data Reuse

301

achievement, for a whole education system not individual schools, classes, or students. When reporting these results, it may be tempting to focus on rankings of participating countries. However, this is not the major objective of ILSAs. Rather, they provide solid data in order to allow participating entities to learn about themselves (especially, if they participate repeatedly) and to learn from one another. ILSAs also identify relations between assessment or survey outcomes and contextual factors, in terms of correlations not cause-and-effect relationships. All results, those about achievements as well as those about possible correlations of achievements with contextual factors, need to be interpreted carefully. It is crucial to consider them in the cultural, political, economic, and social contexts of each country and, finally, also in the context of each study, as they differ in terms of their background, culture, and design. ILSAs provide enormous amounts of data and important information to all those interested in offering and supporting high-quality education around the world. However, it lies within the responsibility of all parties involved, i.e., (other) researchers, policy makers, and practitioners, to use them at their best.

References Ainley, J., & Carstens, R. (2018). Teaching and learning international survey (TALIS) 2018: Conceptual framework. OECD education working papers no. 187. Paris. Retrieved from OECD website: https://doi.org/10.1787/799337c2-en Avvisati, F. (2017). PISA. In IEA (Ed.), ILSA Gateway. Retrieved from https://www.ilsa-gateway. org/studies/factsheets/70 Berezner, A., & Adams, R. J. (2017). Why large-scale assessments use scaling and item response theory. In P. Lietz, J. C. Cresswell, K. F. Rust, & R. J. Adams (Eds.), Implementation of largescale education assessments (pp. 323–356). Wiley. Chabaditsile, G. K., Galeboe, A. K., & Dominic Nkwane, T. (2018). The SACMEQ IV project in Botswana: A study of the conditions of schooling and the quality of education. Retrieved from SACMEQ website: http://www.sacmeq.org/sites/default/files/final_saqmeq_iv_report_ botswana-compressed.pdf Chromy, R. R. (2002). Sampling issues in design, conduct, and interpretation of international comparative studies of school achievement. In National Research Council (Ed.), Methodological advances in cross-national surveys of educational achievement (pp. 80–114). National Academy. Department of Basic Education, Rep. of South Africa. (2017). The SACMEQ IV project in South Africa: A study of the conditions of schooling and the quality of education. Retrieved from http://www.sacmeq.org/sites/default/files/sacmeq/reports/sacmeq-iv/national-reports/ sacmeq_iv_project_in_south_africa_report.pdf Faniel, I. M., Kriesberg, A., & Yakel, E. (2016). Social scientists’ satisfaction with data reuse. Journal of the Association for Information Science and Technology, 67(6), 1404–1416. https:// doi.org/10.1002/asi.23480 Faniel, I. M., Frank, R. D., & Yakel, E. (2019). Context from the data reuser’s point of view. Journal of Documentation, 75(6), 1274–1297. https://doi.org/10.1108/JD-08-2018-0133 Foy, P. (2018). PIRLS 2016: User guide for the international database. Retrieved from IEA website: https://timssandpirls.bc.edu/pirls2016/international-database/downloads/P16_UserGuide.pdf Fraillon, J., Jung, M., Borchert, L., & Tieck, S. (2017). ICILS. In IEA (Ed.), ILSA gateway. Retrieved from https://ilsa-gateway.org/studies/factsheets/60 Fraillon, J., Ainley, J., Schulz, W., Friedman, T., & Duckworth, D. (2019). Preparing for life in a digital age: The IEA International Computer and Information Literacy Study 2018.

302

N. Mertes

International report, IEA. Retrieved from https://www.iea.nl/sites/default/files/2019-11/ICILS% 202019%20Digital%20final%2004112019.pdf Gebhardt, E., & Berezner, A. (2017). Database production for large-scale educational assessments. In P. Lietz, J. C. Cresswell, K. F. Rust, & R. J. Adams (Eds.), Implementation of large-scale education assessments (pp. 411–423). Wiley. Glossary. (2017). In IEA (Ed.), ILSA gateway. Retrieved from https://www.ilsa-gateway.org/glossary Hattie, J. A. C. (2009). Visible learning: A synthesis of over 800 meta-analyses relating to achievement. Routledge. Heyneman, S. P., & Lee, B. (2014). The impact of international studies of academic achievement on policy and research. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (Chapman & Hall/CRC statistics in the social and behavioral sciences) (pp. 37–72). CRC Press. Hooper, M., Fishbein, B., Taneva, M., Meyer, S., Savaşcı, D., & Eraydin, I. (2017a). TIMSS advanced. In IEA (Ed.), ILSA gateway. Retrieved from https://ilsa-gateway.org/studies/ factsheets/66 Hooper, M., Fishbein, B., Taneva, M., Meyer, S., Savaşcı, D., & Nozimova, Z. (2017b). PIRLS. In IEA (Ed.), ILSA gateway. Retrieved from https://ilsa-gateway.org/studies/factsheets/63 Hooper, M., Fishbein, B., Taneva, M., Meyer, S., Savaşcı, D., & Omatsone, A. (2017c). TIMSS. In IEA (Ed.), ILSA gateway. Retrieved from https://ilsa-gateway.org/studies/factsheets/65 Hounkpodote, H., Ounteni, M. H., & Marivin, A. (2017). PASEC. In IEA (Ed.), ILSA gateway. Retrieved from https://www.ilsa-gateway.org/studies/factsheets/76 IEA (Ed.). (2017). ILSA gateway. Retrieved from https://ilsa-gateway.org/ IEA. (n.d.-a). IEA website: Early IEA studies. Retrieved from https://www.iea.nl/studies/iea/earlier IEA. (n.d.-b). IEA website: Tools. Retrieved from https://www.iea.nl/data-tools/tools Kim, Y., & Yoon, A. (2017). Scientists’ data reuse behaviors: A multilevel analysis. Journal of the Association for Information Science and Technology, 68(12), 2709–2719. https://doi.org/10. 1002/asi.23892 Lietz, P. (2017). Design, development and implementation of contextual questionnaires in largescale assessments. In P. Lietz, J. C. Cresswell, K. F. Rust, & R. J. Adams (Eds.), Implementation of large-scale education assessments (pp. 92–136). Wiley. Lietz, P., Cresswell, J. C., Rust, K. F. [Keith F.], & Adams, R. J. (2017). Implementation of largescale education assessments. In P. Lietz, J. C. Cresswell, K. F. Rust, & R. J. Adams (Eds.), Implementation of large-scale education assessments (pp. 1–25). Wiley. LLECE. (2015). TERCE: Executive summary. Initial background information. Retrieved from https://unesdoc.unesco.org/ark:/48223/pf0000243980_eng Martin, M. O., Mullis, I. V. S., & Hooper, M. (Eds.). (2016). Methods and procedures in TIMSS advanced 2015. IEA. Retrieved from https://timssandpirls.bc.edu/publications/timss/2015-amethods.html Martin, M. O., Mullis, I. V. S., & Hooper, M. (Eds.). (2017). Methods and procedures in PIRLS 2016. Retrieved from https://timssandpirls.bc.edu/publications/pirls/2016-methods.html Mendelovits, J. (2017). Test development. In P. Lietz, J. C. Cresswell, K. F. Rust, & R. J. Adams (Eds.), Implementation of large-scale education assessments (pp. 63–91). Wiley. Mullis, I. V. S., & Martin, M. O. (Eds.). (2014). TIMSS advanced 2015: Assessment frameworks. IEA. Retrieved from https://timss.bc.edu/timss2015-advanced/frameworks.html Mullis, I. V. S., Martin, M. O., Foy, P., & Drucker, K. T. (2012). PIRLS 2011 international results in reading. IEA. Retrieved from https://pirls.bc.edu/pirls2011/downloads/P11_IR_FullBook.pdf Mullis, I. V. S., Martin, M. O., Foy, P., & Hooper, M. (2017). PIRLS 2016 international results in reading. Retrieved from IEA website: http://timssandpirls.bc.edu/pirls2016/internationalresults/wp-content/uploads/structure/CompletePDF/P16-PIRLS-International-Results-inReading.pdf

13

Overview of ILSAs and Aspects of Data Reuse

303

Mullis, I. V. S., Michael, M. O., Foy, P., Kelly, D. L., & Fishbein, B. (2020). TIMSS 2019 international results in mathematics and science. Retrieved from IEA website: https:// timss2019.org/reports/download-center/ OECD. (2009). PISA data analysis manual: SPSS and SAS. Retrieved from http://www.oecd.org/ pisa/pisaproducts/pisadataanalysismanualspssandsassecondedition.htm OECD. (2017). PISA 2015: Technical report (PISA). Retrieved from https://www.oecd.org/pisa/ data/2015-technical-report/PISA2015_TechRep_Final.pdf OECD. (2018a). PISA for development: Results in focus. Retrieved from https://www.oecd-ilibrary. org/docserver/c094b186-en.pdf?expires¼1582903525&id¼id&accname¼guest& checksum¼9BB58D0B32890E6DECD88305B6A9CD9A OECD. (2018b). PISA website: Data. Retrieved from https://www.oecd.org/pisa/data/ OECD. (2018c). PISA website: What is PISA?. Retrieved from https://www.oecd.org/pisa/ OECD. (2019a). PISA 2018 assessment and analytical framework. PISA. OECD. Retrieved from https://www.oecd-ilibrary.org/docserver/b25efab8-en.pdf?expires¼1582725942&id¼id& accname¼oid011384&checksum¼8A3DA8EED3551BEA8A9E3B1CA517EDBB OECD. (2019b). TALIS 2018 technical report. Retrieved from https://www.oecd.org/education/ talis/TALIS_2018_Technical_Report.pdf PASEC. (2015). PASEC 2014: Education system performance in francophone Sub-Saharan Africa: Competencies and learning factors in primary education. Retrieved from http://www.pasec. confemen.org/wp-content/uploads/2015/12/Rapport_Pasec2014_GB_webv2.pdf Rust, K. [Keith]. (2014). Sampling, weighting, and variance estimation in international large-scale assessments. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (Chapman & Hall/CRC statistics in the social and behavioral sciences) (pp. 117–153). CRC Press. Rutkowski, D., Rutkowski, L., & von Davier, M. (2014). A brief introduction to modern international large-scale assessment. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (Chapman & Hall/CRC statistics in the social and behavioral sciences) (pp. 3–9). CRC Press. Schulz, W., Ainley, J., Fraillon, J., Losito, B., & Agrusti, G. (2016). IEA International Civic and Citizenship Education Study 2016: Assessment framework. IEA. Retrieved from https://link. springer.com/book/10.1007/978-3-319-39357-5 Schulz, W., Carstens, R., Brese, F., Atasever, U., & Nozimova, Z. (2017). ICCS. In IEA (Ed.), ILSA gateway. Retrieved from https://ilsa-gateway.org/studies/factsheets/59 Schulz, W., Carstens, R., Losito, B., & Fraillon, J. (2018). International Civic and Citizenship Education Study 2016: Technical report. Retrieved from https://www.iea.nl/sites/default/files/ 2019-07/ICCS%202016_Technical%20Report_FINAL.pdf Tremblay, K., Fraser, P., Knoll, S., Carstens, R., & Dumais, J. (2017). TALIS. In IEA (Ed.), ILSA gateway. Retrieved from https://www.ilsa-gateway.org/studies/factsheets/71 Viteri, A., & Inostroza Fernández, P. (2017). TERCE. In IEA (Ed.), ILSA gateway. Retrieved from https://www.ilsa-gateway.org/studies/factsheets/67 Wagemaker, H. (2014). International large-scale assessments: From research to policy. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (Chapman & Hall/CRC statistics in the social and behavioral sciences) (pp. 11–36). CRC Press. Zimmerman, A. S. (2008). New knowledge from old data: The role of standards in the sharing and reuse of ecological data. Science, Technology, & Human Values, 33(5), 631–652.

IEA’s TIMSS and PIRLS: Measuring LongTerm Trends in Student Achievement

14

Ina V. S. Mullis and Michael O. Martin

Contents IEA’s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement . . . . . . . . . . . Gradually Updating the TIMSS and PIRLS Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Role of the TIMSS and PIRLS Encyclopedias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roles of Committees of International Experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Role of the National Research Coordinators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Large Numbers of Items Continually Refreshed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The PIRLS 2021 Rotated Design and Number of Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The TIMSS 2023 Rotated Design and Number of Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introducing Ambitious Innovations on a Small Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Less Difficult TIMSS and PIRLS Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transitioning to e-Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methodology for Linking Successive Assessments to a Common Scale . . . . . . . . . . . . . . . . . . . . . . Item Calibration, Conditioning, and Plausible Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Concurrent Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bridges When There Are Major Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TIMSS 2007 Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TIMSS 2019 Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion/Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

306 307 308 309 310 310 311 312 312 312 313 315 315 316 318 318 319 321 322

Abstract

IEA’s TIMSS and PIRLS have been monitoring trends in educational achievement for more than two decades (TIMSS since 1995 and PIRLS since 2001). Providing valid and reliable trend results is not just a matter of simply administering the same instruments again and again. As the world changes, so too must the assessments to reflect new content, shifts in instruction and learning, and numerous technological advancements. I. V. S. Mullis (*) · M. O. Martin TIMSS & PIRLS International Study Center, Boston College, Chestnut Hill, MA, USA e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_15

305

306

I. V. S. Mullis and M. O. Martin

To keep the assessments and results current and relevant, TIMSS and PIRLS use an assessment design that is based on (1) regular updates of assessment frameworks with each assessment cycle, (2) large pools of assessment items that are systematically refreshed, (3) introducing ambitious new initiatives on a small scale, (4) a methodology for linking successive assessments to a common scale, and (5) incorporating a bridge in the rare instances of major changes. In brief, newly developed items are rotated into each assessment cycle according to rigorous designs that provide for more than the majority of each assessment to be based on trend items. In general, for any given assessment, about one-third of the items are newly developed, one-third assess two cycles of trend, and one-third assess three-cycles. Concurrent calibration of IRT item parameters enables linking each new assessment cycle to the original achievement scales created as part of the first assessments by using subsequent concurrent calibrations that link each set of adjacent assessment cycles. For bridging major design changes, representative samples of students from countries take the assessments not only under the new conditions but also under the previous conditions. Keywords

TIMSS · PIRLS · Trends · Concurrent calibration · ILSA

IEA’s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement IEA’s TIMSS (Trends in International Mathematics and Science Study) and PIRLS (Progress in International Reading Literacy Study) have monitored progress in educational achievement internationally for more than two decades. TIMSS has provided comparative measures of mathematics and science achievement at the fourth and eighth grades every 4 years since 1995, with the current 2019 cycle measuring trends in achievement across a 24-year period. PIRLS has monitored changes in reading achievement every 5 years, with the current 2021 cycle providing trends across a 20-year period. Providing valid and reliable trend results in student achievement is considerably more complicated than simply administering the same instruments over and over. The world is a dynamic, changing place and trend assessments also need to be dynamic and evolving enterprises to reflect the changing environment and remain relevant to policy makers and educators. As the world changes, so must the content and methods used for the trend assessments. For example, in the decades since the inception of TIMSS and PIRLS, developments in technology have altered the mathematics and science constructs to be measured first by including calculators and then computers as major areas of teaching and learning. Similarly, for reading, because searching the internet is now the prevalent approach for students to obtain information for school projects (as well as their out-of-school activities), good readers need to be able to navigate efficiently through websites.

14

IEA’s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement

307

While creating new learning areas that need to be assessed, technology also has provided new opportunities to conduct innovative assessments in all curricular areas (Khorramdel et al., in press). Assessment tasks can now be interactive, providing students with audio and video stimuli, manipulatives and tools, and efficient response modes. Such assessments can be engaging and motivating, which leads to higher response rates. In addition, there is the opportunity to collect valuable information about students’ problem-solving strategies as well as their test taking strategies. Because it improves measurement and relevance, assessments around the world are incorporating technology as part of both their content and methods. But what about measuring trends? TIMSS and PIRLS provide robust trend measures, while at the same time including innovative assessment content and methods. TIMSS and PIRLS use an assessment design that was created specifically to provide comparability of achievement measures across assessment cycles while simultaneously evolving to maintain relevance with curriculum and instruction in the participating countries. Essentially, there are a large number of items, with a larger proportion of each assessment cycle based on trend materials than on newly developed materials, and the newly developed materials need to address emerging areas of curriculum and instruction. The TIMSS and PIRLS assessment design for measuring trends is based on: • • • • •

Regular, gradual updates of broadly based assessment frameworks Large pools of assessment items that are systematically refreshed Introducing ambitious new initiatives on a small scale A methodology for linking successive assessments to a common scale Incorporating a bridge in the event of major changes The remainder of the chapter discusses each of these in turn.

Gradually Updating the TIMSS and PIRLS Frameworks The TIMSS and PIRLS frameworks have been kept relatively stable. Figure 1 shows the broad areas that form the TIMSS 2019 Mathematics Framework and TIMSS 2019 Science Framework (Centurino & Jones, 2017; Lindquist et al., 2017). The TIMSS frameworks are based on two dimensions: a content dimension and a cognitive dimension, with each broad area weighted by the target percentage of testing time that is to be devoted to the area. These broad areas have not changed since TIMSS 2003 when the cognitive dimension was added. Because TIMSS reports trends in each of the content and cognitive areas as well as in mathematics and science overall, the areas need to be broad enough, such that each is important enough to warrant at least 20% of the assessment time. Each broad content area consists of a number of smaller topics, with the number of topics in an area corresponding to the percentage emphasis given to the area (e.g., an area weighted 40% would have twice as many topics as an area weighted 20%). To anchor the assessments and provide stability across time, the weightings of the broad dimensions guiding the assessments cannot be increased or decreased more than 5% from one

308

I. V. S. Mullis and M. O. Martin

Fig. 1 TIMSS 2019 content and cognitive domains

assessment cycle to the next. However, the topics within the content areas can be modified, and typically are revised slightly with each cycle to reflect shifts in mathematics and science curricula and instruction. Figure 2 shows that the PIRLS 2021 framework for assessing reading comprehension also has two broad dimensions – reading purposes and comprehension processes (Mullis & Martin, 2019). There are two reading purposes – for literary experience and to acquire/use information – each of which has comprised 50% of each successive PIRLS assessment cycle since PIRLS 2001. The four comprehension processes were introduced in PIRLS 2006. The retrieving and straightforward inferencing processes are combined for reporting purposes as are the integrating and evaluating processes, such that each process measure is based on 50% of the items. Over the past two decades, with a growing number of participating countries, PIRLS has needed to address an increasingly large range of student achievement, requiring a greater range in the difficulty level of the passages. The number of passages grew from 8 passages in PIRLS 2001 (4 literary and 4 informational) to 18 passages in PIRLS 2021 (9 informational and 9 literary).

Role of the TIMSS and PIRLS Encyclopedias The TIMSS and PIRLS Encyclopedias are central to the process of updating the frameworks because they document changes in countries’ curricula since the previous assessment cycle. Producing the TIMSS and PIRLS Encyclopedias about

14

IEA’s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement

309

Fig. 2 PIRLS 2021 reading purposes and comprehension processes

countries’ mathematics and science curricula and instruction are a routine part of participating in each assessment cycle. For example, please see the TIMSS 2015 Encyclopedia (Mullis et al., 2016) and the PIRLS 2016 Encyclopedia (Mullis et al., 2017). The TIMSS and PIRLS Encyclopedias describe each country’s national education policies at the time of that assessment, and provide valuable information about national educational contexts that support understanding the international achievement results. To create the Encyclopedias, each country completes a chapter describing its mathematics and science curricula (TIMSS) or reading curriculum (PIRLS). So that the Encyclopedias also provide comparable country level data, each country completes the TIMSS or PIRLS Curriculum Questionnaire, respectively. The questionnaire asks about the country’s curricular policies, if the curriculum described in the country’s Chapter is under revision, and if so to describe the upcoming changes. Prior to updating the assessment frameworks for the new cycle, the TIMSS & PIRLS International Study Center summarizes the description of the curricula across countries as well as the information about curriculum revision and topics taught to identify any content areas or topics that appear to be emerging or deemphasized across countries.

Roles of Committees of International Experts Before the beginning of an upcoming cycle, the TIMSS & PIRLS International Study Center appoints a committee of international experts that will help guide new assessment development for the upcoming cycle: the committees are named the TIMSS Science and Mathematics Item Review Committee (SMIRC) and the PIRLS Reading Development Group (RDG). For TIMSS or PIRLS, the SMIRC or RDG members, respectively, review the previous assessment’s framework making recommendations for improvement. Then, the TIMSS & PIRLS International Study Center

310

I. V. S. Mullis and M. O. Martin

works together with representatives from the expert groups to incorporate updates suggested by the review of the Encyclopedia’s curricular information across countries and the recommendations of the international experts into a draft framework for the upcoming cycle. Recommendations for updates may involve modifying content, ensuring that the number of topic areas reflects the content domain weighting, and reviewing topic descriptions to improve clarity for item writers.

Role of the National Research Coordinators The updated draft framework is presented for review and discussion at the first meeting of National Research Coordinators (NRCs) for the new assessment cycle. Each country participating in TIMSS or PIRLS identifies an NRC that is responsible for the implementing the assessment in that country. The NRCs are asked for any recommendations about how to improve the data collected, and for TIMSS each content topic is reviewed specifically asking should the topic be retained as is, modified, or deleted. Based on the NRC discussions, the TIMSS and PIRLS frameworks are updated and distributed digitally to the countries for in-country reviews. That is, each country has the opportunity to convene its mathematics and science experts (TIMSS) or reading experts (PIRLS) to comment on the updated frameworks and suggest revisions. Following the in-country reviews, the TIMSS & PIRLS International Study Center and the expert committees work together to incorporate countries’ recommendations, and the countries are given one final opportunity to review the updated frameworks prior to publication.

Large Numbers of Items Continually Refreshed TIMSS and PIRLS have rotated designs to ensure that the items are distributed to students in a way that ensures the same number of students answer each item, that students answering the items are representative samples of the target population, that the items administered are administered in different parts of the assessment (an equal number of times in the beginning and the end), and that the items are paired according to a rigorous algorithm that links all of the items together. The rotated designs also ensure that the majority of the items are retained from cycle to cycle to measure trends and that the remaining items are retired to make room for newly developed items that will address new emerging areas of curriculum and instruction. To provide a robust trend measure, the majority (about two-thirds) of the items in each successive TIMSS or PIRLS assessment are trend items from previous assessments. To refresh and update the item pool for each new assessment cycle, about one-third of the items per subject per grade are retired and replaced with newly developed materials. No items are kept for longer than three cycles. Researchers can seek permission from IEA to use the retired items for research purposes.

14

IEA’s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement

311

The PIRLS 2021 Rotated Design and Number of Items The PIRLS 2021 rotated design at the fourth grade is based on 18 reading passages, with each passage accompanied by about 15 items for a total of 270 items (Martin et al., 2019). Each passage together with its associated items is administered as a 40-min block. Students take two blocks with a half-time break. Since its inception in PIRLS 2001, the PIRLS Reading Framework has specified 50 % literary passages and 50 % informational passages and since PIRLS 2011, each student has received one literary and one informational passage. Across the booklets, the passages occur in alternating order to account for position effect, and the passages are linked according to rigorous plan such that each passage appears with every other passage twice. PIRLS 2021 is transitioning to a digital environment and initiating a group adaptive design (Martin et al., 2019). The group adaptive design is based on having passages of three levels of difficulty (more difficult, medium difficulty, less difficult) that are assembled into booklets of two difficulty levels. The more difficult booklets are combinations of medium difficult and more difficult passages and the less difficult booklets are combinations of medium difficult and less difficult passages. All countries administer the same PIRLS booklets, but at different rates depending on the overall reading achievement level in a country. For example, countries with higher average reading achievement would administer the more difficult booklets at higher rates (proportionally more) than a lower performing country, which would administer the less difficult booklets at a higher rate than a higher performing country. Figure 3 shows the PIRLS 2021 assessment blocks by reading purpose (50% Literary and 50% Informational in accordance with the framework) and by reading difficulty (D-difficult, M-medium, and L-less difficult). Six of the passage/item sets are retained from two cycles ago, six are from last cycle, and six were newly developed for 2021. The shading indicates the six blocks that will be retired after

Fig. 3 Group level adaptive testing in PIRLS 2021

312

I. V. S. Mullis and M. O. Martin

PIRLS 2021: for each purpose, one passage from each of three assessment cycles (2011, 2016, and 2021) as well as one passage from each difficulty level. The PIRLS 2021 design has symmetry with 2/3 trend and 1/3 newly developed materials and can be repeated for PIRLS 2026. The 18 blocks of passages and items are arranged in 18 student booklets, with two passages and items (one literary and one informational) in each booklet. Booklets 1 through 9 are the more difficult booklets and Booklets 10 through 18 the less difficult booklets. As examples of the more difficult booklets, Booklet 1 contains passages InfM1 and LitD1, Booklet 2 contains passages LitD3 and InfD2, and Booklet 3 contains LitM1 and InfD1. Among less difficult booklets, Booklet 10 contains LitL1 and InfM1, Booklet 11 contains InfL2 and Lit M2, and Booklet 12 InfL1 and LitM3.

The TIMSS 2023 Rotated Design and Number of Items Since 2007, the TIMSS rotated design has been based on 14 booklets, each of which has two blocks of science and two blocks of mathematics, with science and mathematics alternated to account for position effect and paired according to a rigorous algorithm (Martin et al., 2017). Each two-block half of the assessment is given 36 min at fourth grade and 45 min at eighth grade totaling 72 min at fourth grade and 90 min at eighth grade, respectively (not including a half-time break). In general, it requires about 800 items to populate this design: mathematics and science each had about 175 items at the fourth grade, and 220 items at the eighth grade. In TIMSS 2023, TIMSS will begin the transition to a group adapted design that mirrors the PIRLS 2021 design. TIMSS 2023 also will have 18 blocks with 2/3 of the blocks trend blocks and 1/3 newly developed materials. Four additional item blocks in TIMSS 2023 at each grade and subject will require approximately an additional 230 items, totaling more than 1,000 items.

Introducing Ambitious Innovations on a Small Scale Rotating 1/3 newly developed items into TIMSS and PIRLS with each assessment cycle provides for a steady stream of new content and item types to be incorporated into the assessment. However, with more countries joining TIMSS and PIRLS, including some with students who, for one reason or another, struggle with the assessments and others with students very familiar with innovative technology, more significant steps often have been necessary to keep the assessments aligned to the populations being assessed as well to new emerging curricula and instruction.

Less Difficult TIMSS and PIRLS Assessments With TIMSS and PIRLS in 2011, it became apparent that some countries were finding the assessments too difficult, to the extent that their achievement results

14

IEA’s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement

313

were not being reliably estimated (Martin et al., 2013). TIMSS and PIRLS needed to extend the lower end of the achievement scales to have better measurement for lower achievers. Finding a way to address this issue began in PIRLS 2011, when three countries took a less difficult version of PIRLS, based on the PIRLS 2011 Assessment Framework but with less difficult reading passages, and four other countries opted to give regular PIRLS at the sixth grade. Neither was a very satisfactory solution, because the approaches did not provide comparable results with the bulk of the PIRLS countries. In the next cycle, PIRLS 2016 had two assessments – PIRLS and PIRLS Literacy – nearly parallel in scope, with the regular PIRLS and less difficult assessments having four blocks in common. This enabled the results to be reported on the same scale, but required double the amount of development resources for only a small number of countries. With the additional resources required for transitioning to digitalPIRLS, the double-assessment approach was not feasible, and PIRLS 2021 introduced group adaptive testing (previously explained). The PIRLS 2021 design unifies PIRLS and PIRLS Literacy (the less difficult version of PIRLS used in previous PIRLS cycles) into a single integrated assessment and, through a groupadaptive approach, expands the range of assessment difficulty and allows PIRLS to be better aligned with the achievement of the assessed student populations in each country. Similarly, in response to countries that needed a less difficult mathematics assessment at the fourth grade, TIMSS 2015 introduced the option of TIMSS Numeracy. This was a special assessment of less difficult mathematics that did not have a science component and was given separately from TIMSS. The less difficult mathematics items were appropriate for a number of countries, except the countries also wanted results for science. So, for TIMSS 2019, some of the “numeracy” blocks were brought forward into regular TIMSS mathematics, replacing some of the blocks, such that there were four blocks to link “less difficult” mathematics and regular mathematics. Countries participating in the less difficult mathematics were reported on the same scale as those participating in regular mathematics, and also had the same science assessment data. As noted previously, TIMSS 2023 will begin the transition to the group adaptive approach where one integrated assessment includes items with a wide range of difficulty.

Transitioning to e-Assessment PIRLS and TIMSS have made the transition to e-assessment in slow stages. It takes three assessment cycles (2016, 2021, and 2026) for PIRLS to become fully digital. PIRLS 2016 initiated the ePIRLS assessment of internet reading on a voluntary basis, to be given on a second day after the regular paper-and-pencil PIRLS 2016 assessment. ePIRLS consisted of five school-like tasks or projects about science and social studies topics. Guided through a simulated internet environment by a teacher avatar, students read websites that had interactive features such as tabs, pop-ups, animations, and videos. As they read and navigated through the websites, students

314

I. V. S. Mullis and M. O. Martin

also answered questions about what they had read. Fourteen countries managed to arrange for a venue and the computers to participate in ePIRLS and this first computer-based venture provided valuable results. Almost all the students completed the assessment, the results provided the countries with interesting information about the differences between paper and internet reading, and the students liked the internet reading better. Two of the tasks together with their questions and scoring guides were released to the public via the PIRLS website. For PIRLS 2021, approximately 30 countries (half) transitioned to the digital environment and ePIRLS is incorporated into the digitally based assessment. Two new ePIRLS tasks were developed to replace the two 2016 tasks that were released on the web, and the five tasks are rotated among the 18 PIRLS booklets, with some hybrid booklets (one passage digitalPIRLS and one ePIRLS task). The TIMSS & PIRLS International Study Center expects to learn much more about the assessment properties of ePIRLS after analyzing the 2021 results. To complete the transition, it is anticipated that in PIRLS 2026 nearly all countries will be administrating PIRLS in a digital environment and ePIRLS can be fully incorporated into PIRLS for all of the countries. The degree to which PIRLS 2026 will be devoted to assessing internet reading will be a matter of discussion and agreement among the TIMSS & PIRLS International Study Center, the PIRLS 2026 RDG, and the PIRLS 2026 NRCs. TIMSS transitioned to computer-based assessment over the 2019 and 2023 assessment cycles. For the more than 30 countries (about half) that transitioned to e-assessment in TIMSS 2019, in addition to the 14 regular item blocks, the assessment included three blocks per subject per grade of newly designed problem solving and inquiry tasks, known as PSIs, especially developed to take advantage of the new digital environment. The PSIs simulate real-world and laboratory situations where students can integrate and apply process skills and content knowledge to solve mathematics problems and conduct scientific experiments or investigations. The demanding criteria for the PSIs made them very difficult and resource intensive to develop. Special teams of consultants as well as the TIMSS 2019 SMIRC members collaborated virtually and in meetings to develop tasks that: 1) assess mathematics and science (not reading or perseverance), 2) take advantage of the “e” environment, and 3) are engaging and motivating for students. The PSI tasks – such as design a building or study plantsʼ growing conditions – involved visually attractive, interactive scenarios that presented students with adaptive and responsive ways to follow a series of steps toward a solution. There also was an opportunity digitally to track students’ problem solving or inquiry paths through the PSIs. Studying the process data about what student approaches were successful or unsuccessful in solving problems may provide information to help improve instruction. The TIMSS 2019 eTIMSS countries participated in an ambitious pilot study. Then, for the 2019 assessment, blocks containing only the PSIs were rotated through the 14 booklet design of regular items, resulting in a separate sample of students responding to the PSIs. For the half of the countries that made the transition to eTIMSS, the results from the PSI tasks can be analyzed and reported separately, keeping the data from the 14 booklet design equivalent for the eTIMSS and

14

IEA’s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement

315

paperTIMSS countries and enabling trend results for all countries in TIMSS 2019. After the TIMSS 2019 results are released, the PSIs can be analyzed one-by-one and scaled together with the regular TIMSS items to report the results and inform future plans. As part of designing the fully digital TIMSS 2023, the role of the PSIs can be decided so that the PSIs contribute in the way most consistent with countries’ mathematics and science curricula and instruction.

Methodology for Linking Successive Assessments to a Common Scale TIMSS and PIRLS use a complex system of item response theory (IRT) scaling with latent regression to transform student responses to the achievement items into proficiency estimates on the TIMSS and PIRLS achievement scales (Foy & Yin, 2016, 2017; Martin & Mullis, 2019). The general procedure was developed by ETS for the US National Assessment of Educational Progress (NAEP), and its use in international large-scale assessments such as TIMSS and PIRLS is described in von Davier and Sinharay (2014). Its application to linking successive assessments to a common scale is described in Mazzeo and von Davier (2014), which also informed the treatment of the process in this chapter. The analyses involved in linking successive assessments are conducted in four separate stages – item calibration through IRT scaling, population modeling with latent regression, student proficiency estimation with plausible values, and scale transformation to reporting scales.

Item Calibration, Conditioning, and Plausible Values In the item calibration stage, TIMSS and PIRLS use two- and three-parameter logistic and generalized partial-credit IRT models to derive item parameters for each achievement item that characterize its properties, such as difficulty, discriminating power, and impact of guessing (multiple choice items). These item parameters allow blocks of items of varying difficulty and discrimination to be combined in a single assessment and reported on a common scale, and are a key element of the TIMSS and PIRLS reporting procedure. Through TIMSS 2019, item calibration was conducted using the PARSCALE software (Muraki & Bock, 1997). In the population modeling/latent regression stage, the item parameters from the first stage are combined with students’ item responses and available demographic and other background data in a latent regression with student proficiency as the latent (unobserved) dependent variable. This process, also known as conditioning, provides more reliable proficiency estimates than would using item parameters and student responses alone. In the third, plausible value stage, the latent regression coefficients, student responses, and background variables are used to provide a set of plausible values (proficiency estimates) for each student. The plausible values are random draws from the estimated posterior distribution of student proficiency given the item responses,

316

I. V. S. Mullis and M. O. Martin

background variables, and regression coefficients estimated in the previous step. Assessments through TIMSS 2019 and PIRLS 2016 used ETS’s DGROUP program (Rogers et al., 2006) to generate IRT proficiency estimates. In the final, scale transformation stage, the plausible values undergo linear transformation to the metric used by the assessments to report results – mathematics and science for TIMSS and reading for PIRLS. The TIMSS reporting metric was established in TIMSS 1995, the first TIMSS assessment cycle, by taking the mean and standard deviation of the plausible values across all participating countries for each subject as points of reference. Applying a linear transformation, the international mean was transformed to 500 points on the reporting metric and the international standard deviation to 100 points for each subject and grade level. A similar approach was taken to establishing the PIRLS reading scale in PIRLS 2001, the first PIRLS assessment. In subsequent TIMSS and PIRLS assessment cycles, the transformation was derived so that the current assessment results are expressed in the reporting metric established by the first TIMSS or PIRLS assessment cycle.

Concurrent Calibration To enable TIMSS to measure trends over time, achievement data from successive TIMSS assessments are transformed to the same mathematics and science achievement scales originally established in 1995. This is done by concurrently scaling the data from each assessment together with the data from the previous assessment – a process known as concurrent calibration – and applying linear transformations to place the results from the assessment on the same scale as the results from the previous assessment. In concurrent calibration, item parameters for the current assessment are estimated based on the data from both the current and previous assessments, recognizing that a large number of items (the trend items) are common to both. It is then possible to estimate the latent ability distributions of students in both assessments using the item parameters from the concurrent calibration. The difference between these two distributions is the change in achievement from one assessment to the next. The stability of the concurrent calibration linking is dependent to a considerable extent on having a substantial number of trend items, which are retained from one assessment to the next. For example, TIMSS 2015 had eight blocks of items in common with TIMSS 2011 for each subject and grade, such that the TIMSS 2011 and 2015 concurrent scaling of fourth-grade mathematics involved 20 blocks of items: 14 item blocks from 2011 and 14 item blocks from 2015 with 8 blocks containing 102 items in common to both assessments. Figure 4 illustrates how the concurrent calibration approach is applied in the context of TIMSS trend scaling. The gap between the distributions of the previous assessment data under the previous calibration and under the concurrent calibration (Fig. 4, second panel) is typically small and is the result of slight differences in the item parameter estimates from the two calibrations. The linear transformation removes this gap by shifting the two distributions from the concurrent calibration, such that the distribution of the previous assessment data from the concurrent

14

IEA’s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement

317

Fig. 4 TIMSS concurrent calibration model

calibration aligns with the distribution of the previous assessment data from the previous calibration, while preserving the gap between the previous and current assessment data under the concurrent calibration. This latter gap represents the change in achievement between the previous and current assessments that TIMSS sets out to measure as trend. After the item parameters have been estimated during the item calibration phase, the next step is to conduct a latent regression analysis in which the items with their item parameters are treated as indicators of the latent ability and the student background variables are treated as covariates. The plausible values generated by TIMSS are multiple imputations from this latent regression model based on the students’ responses to the items they were given, the item parameters estimated in the calibration stage, and the students’ background characteristics. Because the plausible values generated from the latent regression analysis are conditional on the student background data, the background variables collectively are known as the conditioning model. To provide a mechanism for estimating the uncertainty due to the item sampling process, and following the practice first established by NAEP, TIMSS generates five plausible values for each student on each TIMSS achievement scale. To provide results for the current assessment on the established TIMSS achievement scales, the plausible values for mathematics and science scales generated by DGROUP are transformed to the TIMSS reporting metric. This is accomplished through a set of linear transformations as part of the concurrent calibration procedure. These linear transformations were given by PV k,i ¼ Ak,i þ Bk,i  PV k,i where PVk,i is plausible value i of scale k prior to transformation. PV k,i is plausible value i of scale k after transformation.

318

I. V. S. Mullis and M. O. Martin

Ak,i and Bk,i are linear transformation constants. The linear transformation constants are obtained by first computing the international means and standard deviations of the proficiency scores for the overall mathematics and science scales using the plausible values produced in the previous assessment based on the previous item calibrations for the trend countries (countries that participated in both the previous and current assessments). Next, the same calculations are carried out using the plausible values from the rescaled previous assessment data based on the concurrent item calibrations for the same set of countries. From these calculations, the linear transformation constants are defined as Bk,i ¼ σ k,i =σ k,i Ak,i ¼ μk,i  Bk,i  μk,i where μk,i is the international mean of scale k based on plausible value i for the previous assessment. μk,i is the international mean of scale k based on plausible value i from the previous assessment based on the concurrent calibration. σ k,i is the international standard deviation of scale k based on plausible value i for the previous assessment. σ k,i is the international standard deviation of scale k based on plausible value i from the previous assessment based on the concurrent calibration. There are five sets of transformation constants for each scale, one for each plausible value. These linear transformation constants are applied to the overall proficiency scores – mathematics and science – at both grades and for all participating countries and benchmarking participants. This provides student proficiency scores for the current assessment that are directly comparable to the proficiency scores from previous TIMSS assessments.

Bridges When There Are Major Changes Sometimes it is necessary to make a major change from one assessment cycle to the next without delay. In these situations, it is important to include a bridge to the previous assessment by administering the “old” assessment to a representative sample of students together with the new assessment. The analysis of the bridge data can be very complicated, but it is the best way to link previous procedures and methods to more appropriate approaches and move into the future.

TIMSS 2007 Bridge TIMSS conducted its first bridging study to link TIMSS 2007 back to TIMSS 2003 (Foy et al., 2008). TIMSS 2003 data made it apparent that not all students had

14

IEA’s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement

319

sufficient time to complete their assessment booklets, so a new booklet design was introduced in TIMSS 2007 that provided more time for students to respond to the items. Unlike the TIMSS 2003 booklets, which each contained six blocks of items, the TIMSS 2007 booklets each comprised just four of the same sized blocks to be completed in the same amount of time (i.e., 72 min at the fourth grade and 90 min at the eighth grade). Concerned that the 2007 assessment booklets might be less difficult because students had more time, TIMSS implemented a “bridging study” to see if this was indeed the case. The TIMSS 2007 bridging study involved the administration of a subset of the TIMSS 2003 assessment booklets in 2007 to establish a bridge between the 2003 and 2007 assessments. Four of the 2003 booklets were inserted into the rotation of the 14 2007 assessment booklets, such that the four bridge booklets were administered alongside the TIMSS 2007 assessment booklets to randomly equivalent samples of students in all trend countries (countries that participated in both TIMSS 2003 and TIMSS 2007). All item blocks in the bridge booklets also had been part of the TIMSS 2003 assessment, and four mathematics and four science blocks in the bridge booklets (at each grade) also were included in the TIMSS 2007 assessment booklets. Presenting the same items using the 2007 bridge booklets and the 2007 assessment booklets allowed TIMSS to isolate the effect of changing the booklet design, and to provide enough data to adjust for this effect. The data from the bridging study showed a slight decrease in the difficulty of the achievement items between 2003 and 2007, and provided a basis for maintaining the measurement of trends.

TIMSS 2019 Bridge In 2019, TIMSS began the transition to computer-based assessment by introducing a computerized version known as eTIMSS. Half the participating countries in 2019 chose to administer the new eTIMSS version, while the other half retained the traditional paper-based administration – referred to as paperTIMSS. Every effort was made to have the eTIMSS and paperTIMSS assessments as similar as possible, while capitalizing on new item types such as drag-and-drop and drop-down menus, and automated scoring through keypad entry. The goal in item development was to maximize the comparability of eTIMSS and paperTIMSS by ensuring the two versions of the assessment measured the same mathematics and science constructs using the same items as much as possible. Despite these efforts, however, given the likelihood of some degree of mode effect between eTIMSS and paperTIMSS, eTIMSS included a bridge to paperTIMSS. To provide bridging data, eTIMSS countries also administered the paperTIMSS booklets consisting of the eight blocks of trend items to an additional subsample of 1,500 students, sampling from the same schools as the full eTIMSS samples to the extent possible. As samples from eTIMSS countries taking items in paperTIMSS format, the bridge data form an intermediate link (or bridge) between eTIMSS countries taking eTIMSS items and paperTIMSS countries taking items in paper format.

320

I. V. S. Mullis and M. O. Martin

The major challenge in scaling the 2019 data was linking the eTIMSS and paperTIMSS data to the existing TIMSS achievement scales so that the achievement results are comparable, while maintaining trends from previous TIMSS assessments. The scaling of the TIMSS 2019 achievement data proceeded in two phases: calibrating the paperTIMSS and bridge data on the TIMSS achievement scales, and linking the eTIMSS data to these scales. The bridge data played a crucial role in each phase.

Phase 1: Item Calibration for paperTIMSS and Bridge Data The first phase was to link the TIMSS 2019 achievement data to the TIMSS achievement scales using the paperTIMSS items, combining the paperTIMSS countries’ data and the bridge data from the eTIMSS countries. This was accomplished through a concurrent calibration of the TIMSS 2019 paperTIMSS and eTIMSS bridge data together with the TIMSS 2015 data from those countries that participated in both 2015 and 2019 assessments. The 2019 concurrent calibration model provided item parameter estimates in a common metric for all paperTIMSS items, which then were used to reestimate the TIMSS 2015 ability distribution across all trend countries and derive the linear transformation to align this reestimated 2015 ability distribution with the original ability distribution previously estimated in 2015. Including the bridge data ensured that the eTIMSS countries as well as the paperTIMSS countries contributed to the concurrent calibration and scale estimation. The linear transformation, applied to the 2019 paperTIMSS data and to the eTIMSS bridge data, produced proficiency estimates for all countries on the TIMSS achievement scales. For the paperTIMSS countries, these are the TIMSS 2019 proficiency scores that are reported and published, whereas for eTIMSS countries they are estimates of how they would have performed on paperTIMSS based on their bridge data, and are instrumental in linking to the eTIMSS data in the next phase. Phase 2: Linking eTIMSS Data to TIMSS Trend Scales The purpose of Phase 2 was to link the eTIMSS assessment data to the TIMSS achievement scale established for the paperTIMSS data in Phase 1, while adjusting for any mode effects between the two assessment versions. This is possible because while the eTIMSS data are the result of administering the eTIMSS items in the eTIMSS countries, the eTIMSS bridge data resulted from administering a large subset of those same items in paper format in the same countries. Calibrating the eTIMSS data together with the eTIMSS bridge data, while holding the item parameters for the bridge data fixed at their values from Phase 1, allows for a comparison of the psychometric properties of the items under both modes – paperTIMSS and eTIMSS. Because the eTIMSS and paperTIMSS versions of the items were designed to be as similar as possible, a large percentage (approximately 80%) were found to be psychometrically equivalent (“invariant”) between eTIMSS and paperTIMSS, with just a small international adjustment constant to the difficulty parameter. As a result, item parameters for such invariant eTIMSS items could be fixed at their Phase 1 values for their paperTIMSS counterparts, but estimated from the eTIMSS data

14

IEA’s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement

321

only for the 20% found to be non-invariant. This procedure had the effect of capitalizing on the existence of a substantial proportion of invariant items to provide a good fit to the data and establish a solid link between the eTIMSS and paperTIMSS assessments, while allowing the non-invariant items to behave differently in eTIMSS and paperTIMSS. Full descriptions of the TIMSS 2019 scaling can be found in von Davier (2020), Foy et al. (2020), and von Davier et al. (2020).

Conclusion/Summary Countries around the world value reliable data about long-term trends in educational achievement as a way to monitor the effectiveness of their educational systems. Despite countries’ widespread interest in obtaining and analyzing long-term trends in educational achievement, however, producing these data is extremely complicated. The method of giving the same assessment over and over does not work. Primarily because the world is changing as the years go by, the content of the assessment instruments will become dated and the assessment methods will become outmoded. In today’s world, technology is changing at an ever-increasing speed, making it obvious when assessment programs do not keep up with the times. However, being able to regularly report robust trend measures while keeping the assessments forward looking requires careful planning and sound methodology. PIRLS was designed to be a trend assessment from its inception in 2001, and so was TIMSS beginning with TIMSS 2003. The assessment designs for both TIMSS and PIRLS sometimes have been modified across the past two decades, but the central idea has remained constant – they should be evolving assessments with the twin goals of looking back to measure trends and looking forward to reflect cutting-edge assessment methods. Currently, PIRLS 2021 and TIMSS 2023 are based on the same item block design containing 18 blocks. Each assessment cycle, 6 item blocks are retired and replaced with newly developed assessment items, resulting in each assessment being comprised of 2/3 trend items and 1/3 newly developed items. And every assessment has a large number of items (about 150 to 250 items) so that the trend is based on a stable measure, and there are enough new items to introduce new content and item formats. Consistent with the above assessment design that accommodates a sizeable proportion of newly developed items (1/3), TIMSS and PIRLS gradually update the assessment frameworks with each assessment cycle. The broad content areas (e.g., number, algebra, and geometry in mathematics and biology, physics, and chemistry in science) cannot be changed by more than 5% of the assessment weighting from cycle to cycle. However, the assessment topics within the broad areas (each of which receive equal weighting) can be updated. As the third aspect of the trend plan, together with the evolving assessment design that requires updating the assessment items each cycle, TIMSS and PIRLS use concurrent scaling of adjacent assessments to link the assessments from one to next. The TIMSS and PIRLS achievement scales were originally created based on the achievement results of the 30–40 countries that participated in 1995 and 2001, respectively. Since then, with each new assessment cycle, the data for all the

322

I. V. S. Mullis and M. O. Martin

countries in the recent and previous assessments are scaled together providing item parameters for the recent assessment, and enabling the results for each successive assessment to be placed onto the TIMSS and PIRLS achievement scales. Using the above approach to develop and report the results for each ensuing assessment cycle – updated framework, substantial numbers of newly developed items, and concurrent scaling to link adjacent assessments – allows the TIMSS and PIRLS assessments to evolve gradually. However, sometimes more radical or pervasive changes are necessary to maintain relevance and cutting-edge measurement in the next assessment cycle. One way is to begin assessing new, promising areas of learning and assessment in mathematics, science, and reading or using new methods on a small scale as a separate effort together with the ongoing assessment. For example, PIRLS inaugurated the ePIRLS computer-based assessment of online reading together with the entirely paper-and-pencil-based PIRLS 2016. Fourteen countries participated in the ePIRLS 2016, which involved a assessing a new reading construct – online reading, and a new response mode – computer based assessment. In PIRLS 2021, 30 countries (half the countries) are administering ePIRLS, and then ePIRLS will become an integral part of PIRLS 2026. Similarly, the TIMSS problemsolving and inquiry tasks (PSIs) were introduced with eTIMSS 2019 in half the countries, with an eye to PSIs becoming an integral part of TIMSS 2023. Lastly, an assessment can introduce new methods on a faster schedule, if a bridge is built to the past. That is, at the same time as the new assessment using the new methods is administered, an appropriately sized sample of students in the same schools (or at least the same target population) also takes the assessment using the previously assessed instruments and old methods. The difference in the results between the new and previous assessment methods can be used to either confirm that the trends results were the same for both methods, or make adjustments in the results so that the new methods can continue to be used to measure trends into the future.

References Centurino, V. A. S. & Jones, L. R. (2017). TIMSS 2019 science framework. In Mullis, I. V. S. & Martin, M. O. (Eds.), TIMSS 2019 Assessment frameworks (pp. 13-25). Retrieved from , TIMSS & PIRLS International Study Center website: http://timssandpirls.bc.edu/timss2019/frameworks Foy, P., Fishbein, B., von Davier, M., & Yin, L. (2020). Implementing the TIMSS 2019 scaling methodology. In M. O. Martin, M. von Davier, & I. V. S. Mullis (Eds.), Methods and procedures: TIMSS 2019 technical report (pp. 12.1-12.146). Retrieved from , TIMSS & PIRLS International Study Center website: https://timssandpirls.bc.edu/timss2019/methods/chapter-12.html Foy, P., Galia, J., & Li, I. (2008). Scaling the data from the TIMSS 2007 mathematics and science assessments. In J. F. Olson, M. O. Martin, & I. V. S. Mullis (Eds.), TIMSS 2007 technical report (pp. 225–279). TIMSS & PIRLS International Study Center, Boston College. Foy, P., & Yin, L. (2016). Scaling the TIMSS 2015 achievement data. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and procedures in TIMSS 2015 (pp. 13.1-13.62). Retrieved from , TIMSS & PIRLS International Study Center website: http://timss.bc.edu/ publications/timss/2015-methods/chapter-13.html Foy, P., & Yin, Y. (2017). Scaling the PIRLS 2016 achievement data. In M. O. Martin, I. V. S. Mullis, & M. Hooper (Eds.), Methods and procedures in PIRLS 2016 (pp. 12.1-12.38). Retrieved from , TIMSS & PIRLS International Study Center website: https://timssandpirls.bc. edu/publications/pirls/2016-methods/chapter-12.html

14

IEA’s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement

323

Khorramdel, L., von Davier, M., Yamamoto, K. (Eds.) (In Press), Innovative computer-based international large-scale assessments. Springer Lindquist, M., Philpot, R., Mullis, I. V. S., & Cotter, K. E. (2017). TIMSS 2019 mathematics framework. In Mullis, I. V. S. & Martin, M. O. (Eds.), TIMSS 2019 Assessment frameworks (pp. 13-25). Retrieved from , TIMSS & PIRLS International Study Center website: http: // timssandpirls.bc.edu/timss2019/frameworks Martin, M. O., & Mullis, I. V. S. (2019). TIMSS 2015: Illustrating advancements in large-scale international assessments. Journal of Educational and Behavioral Statistics, 44(6), 752–781. https://doi.org/10.3102/1076998619882030 Martin, M. O., Mullis, I. V. S., & Foy, P. (2013). The Limits of measurement: Problems in measuring trends in student achievement for low-performing countries. In N. McElvany & H. G. Holtappels (Eds.), Festschrift, Prof. Dr. Wilfried Bos, Studien der Empirischen Bildungsforschung—Befunde und Perspektiven [Festschrift for Prof. Dr. Wilfried Bos, Studies of Empirical Educational Research—Findings and Perspectives]. Waxmann. Martin, M. O., Mullis, I. V. S., Foy, P. (2017). TIMSS 2019 Assessment design. In Mullis, I. V. S., & Martin, M. O. (Eds.), TIMSS 2019 Assessment frameworks (pp. 81-91). Retrieved from , TIMSS & PIRLS International Study Center website: http://timssandpirls.bc.edu/timss2019/frameworks/ Martin, M. O., von Davier, M., Foy, P., Mullis, I. V. S. (2019). Assessment design for reference in PIRLS 2021 Frameworks. In Mullis, I. V. S., & Martin, M. O. (Eds.), PIRLS 2021 Assessment frameworks. Retrieved from , TIMSS & PIRLS International Study Center website: http:// pirls2021.org/frameworks Mazzeo, J., & von Davier, M. (2014). Linking scales in international large-scale assessments. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: background, Technical Issues, and methods of data analysis. Boca Raton. Mullis, I. V. S. & Martin, M. O. (Eds.), (2017). TIMSS 2019 assessment frameworks. Retrieved from Boston College, TIMSS & PIRLS International Study Center website: http://timssandpirls. bc.edu/timss2019/frameworks/ Mullis, I. V. S. & Martin, M. O. (2019). PIRLS 2021 Reading assessment framework. In Mullis, I. V. S. & Martin, M. O. (Eds.), PIRLS 2021 Assessment frameworks (pp. 5-25). Retrieved from , TIMSS & PIRLS International Study Center website: http://pirls2021.org/wp-content/uploads/ sites/2/2019/04/P21_Frameworks.pdf Mullis, I. V. S., Martin, M. O., Goh, S., & Cotter, K. (Eds.), (2016). TIMSS 2015 Encyclopedia: Education policy and curriculum in mathematics and science. Retrieved from , TIMSS & PIRLS International Study Center website: http://timssandpirls.bc.edu/timss2015/encyclopedia/ Mullis, I. V. S., Martin, M. O., Goh, S., & Prendergast, C. (Eds.), (2017). PIRLS 2016 Encyclopedia: Education policy and curriculum in reading. Retrieved from , TIMSS & PIRLS International Study Center website: http://timssandpirls.bc.edu/pirls2016/encyclopedia/ Muraki, E., & Bock, R. D. (1997). PARSCALE: IRT item analysis and test scoring for rating scale data [computer software]. Scientific Software, Inc.. Rogers, A., Tang, C., Lin, M.-J., & Kandathil, M. (2006). D-Group (computer software). Educational Testing Service. von Davier, M. (2020). TIMSS 2019 scaling methodology: Item Response Theory, population models, and linking across modes. In M. O. Martin, M. von Davier, & I. V. S. Mullis (Eds.), Methods and Procedures: TIMSS 2019 Technical Report (pp. 11.1-11.25). Retrieved from , TIMSS & PIRLS International Study Center website: https://timssandpirls.bc.edu/timss2019/methods/chapter-11.html von Davier, M., Foy, P., Martin, M. O., & Mullis, I. V. S. (2020). Examining eTIMSS country differences between eTIMSS data and bridge Data: A look at country-level mode of administration effects. In M. O. Martin, M. von Davier, & I. V. S. Mullis (Eds.), Methods and Procedures: TIMSS 2019 Technical Report (pp. 13.1-13.24). Retrieved from , TIMSS & PIRLS International Study Center website: https://timssandpirls.bc.edu/timss2019/methods/ chapter-13.html Von Davier, M., & Sinharay, S. (2014). Analytics in international large-scale assessments: item response theory and population models. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: background, Technical Issues, and methods of data analysis. Boca Raton.

IEA’s Teacher Education and Development Study in Mathematics (TEDS-M)

15

Framework and Findings from 17 Countries Sigrid Blo¨meke

Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Need for a Teacher ILSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Structure of the Present Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conceptual Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Macro Level: Countries’ National Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Meso Level: Institutional Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Microlevel: Future Teachers’ Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teacher Education Outcomes: Teachers’ Professional Knowledge . . . . . . . . . . . . . . . . . . . . . . . . Teacher Education Outcomes: Teachers’ Professional Beliefs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Study Design of TEDS-M . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Instruments: National (Macro Level) and Institutional Context (Meso Level) . . . . . . . . . . . . Instruments: Future Teacher Surveys (Microlevel) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Macro Level: Structure and Quality Assurance of Teacher Education . . . . . . . . . . . . . . . . . . . . . Meso Level: Characteristics of Teacher Educators and Opportunities to Learn . . . . . . . . . . . Microlevel: Characteristics of Future Teachers (Based on Blömeke and Kaiser, 2014) . . . Outcomes of Teacher Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

326 327 329 329 330 331 332 333 333 334 335 335 337 338 342 343 357 359 362 364 369 369 370

S. Blömeke (*) University of Oslo, Oslo, Norway e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_16

325

326

S. Blo¨meke

Abstract

TEDS-M (Teacher Education and Development Study in Mathematics) was carried out in 17 countries by the International Association for the Evaluation of Educational Achievement (IEA). It was the first international large-scale assessment where primary and lower-secondary mathematics teachers at the end of their teacher education ( future teachers) were tested on what they knew and believed. The data revealed that most countries offered two pathways to a teaching license: a 4-year specialized teacher education program concluding with a bachelor degree and a longer pathway starting with a subject-specific bachelor program followed by a specialized teacher education program. Taiwan achieved the best TEDS-M results for mathematics content knowledge (MCK) and mathematics pedagogical content knowledge (MPCK) of future primary (together with Singapore) and lowersecondary teachers. In contrast, MCK and MPCK of future teachers from Botswana, Chile, Georgia, and the Philippines were lowest. In high-achieving TEDS-M countries, the level of mathematics classes offered and the emphasis of teaching practice were higher than elsewhere. A typical future primary and lower-secondary teacher was female and 24 years old at the end of teacher education. There were 26–100 books in her parents’ home, and she had good grades in high school. Intrinsicpedagogical motives had been decisive for becoming a teacher. Teacher education outcomes were not an exclusive result of countries’ sociocultural characteristics, individual teacher characteristics, or the structure of a teacher education program but a complex amalgam of these characteristics. TEDS-M triggered many follow-up studies that took place in new countries (e.g., Peru), new domains (e.g., future language teachers), or new career stages (e.g., in-service teachers). Keywords

International large-scale assessment · Teacher knowledge · Mathematics teacher · Teacher beliefs · Opportunities to learn

Introduction With the publication of results from international large-scale assessments (ILSAs) about student achievement, predictors of this achievement malleable by educational policymakers have become areas of considerable research interest. The objective is to identify ways to improve the outcomes of schooling – not only cognitive but also affective-motivational ones because these are regarded as relevant for long-term success in life and on the labor market. Teachers’ competence is such a predictor of student outcomes and has thus become an important research area. While the benefits or limitations related to students’ family background are difficult to change for educational policymakers, teachers’ knowledge, skills, and the instructional quality implemented by them are prone to development through initial teacher education and continuous professional

15

IEA’s Teacher Education and Development Study in Mathematics (TEDS-M)

327

development. At the same time, teachers’ competence explains substantial variance in students’ cognitive (Chetty et al., 2014; Hill et al., 2005) and noncognitive characteristics (Blazar & Kraft, 2017; Blömeke & Olsen, 2019). Exactly, how to design teacher education and professional development and which effects they have on teacher competence had not been examined on a large scale and across countries until about 10 years ago. This research gap made a new type of ILSA targeting teacher competence particularly valuable. In 2010, results were released from the “Teacher Education and Development Study in Mathematics” (TEDS-M), carried out under the supervision of the International Association for the Evaluation of Educational Achievement (IEA) (Tatto et al., 2008). Seventeen countries took part in this first ILSA where not students were the target population, such as in TIMSS or PISA, but teachers were tested in a standardized way on what they knew and believed at the end of teacher education. The target populations were primary and lower-secondary mathematics teachers that would soon receive a license to teach mathematics, either in one of the grades 1 through 4 (TEDS-M PRIMARY) or in grade 8 (TEDS-M SECONDARY). These two target populations are called future teachers (FTs) in the following.

The Need for a Teacher ILSA Mathematics teachers will have a central role in the preparation of the next generations of students. Mathematics does not only belong to the core school subjects worldwide (Mullis et al., 2008) but is also essential for meeting everyday occupational requirements (Freudenthal, 1983). Comparing mathematics teachers’ competence across countries in terms of their knowledge and beliefs can therefore be regarded crucial because it targets one of the most important malleable parameters of the instructional quality these teachers will implement. Teaching mathematics represents an important part of all primary teachers’ responsibilities since they will work as class teachers teaching most subjects including mathematics during the first years of schooling. Similarly, teaching mathematics represents an important part of lower-secondary teachers who will teach mathematics as subject specialists during the following years of schooling. TEDS-M was not only the first ILSA of teachers but also the first ILSA of higher education outcomes across countries using representative samples. Besides TEDS-M, there has yet only been one ILSA that tried to implement standardized assessments of higher education outcomes across countries, namely, the “Assessment of Higher Education Learning Outcomes (AHELO)” organized by the Organization for Economic Co-operation and Development (OECD). AHELO was a feasibility study and therefore carried out with convenience samples only (i.e., not drawing representative samples; Tremblay et al., 2012). Seventeen countries took part and implemented assessments in economics and civil engineering. The samples were small, and OECD decided not to follow up the idea of assessing higher education outcomes due to a broad range of conceptual and methodological challenges (OECD, 2013). When IEA decided to launch TEDS-M, no measures existed regarding the dimensions and facets supposed to constitute teacher competence. Standardized

328

S. Blo¨meke

instruments assessing teachers’ knowledge and their beliefs therefore had to be developed from scratch. Cross-national data about teachers’ knowledge from any types of standardized testing was almost nonexistent, either (Blömeke, 2004; Cochran-Smith & Zeichner, 2005). OECD’s Teacher and Learning International Survey (TALIS), which also targeted teachers, included self-reported data only. Other research studies that included standardized assessments of teacher knowledge (e.g., Baumert et al., 2010; Hill et al., 2008) had taken place in one country only or did not scale data from different countries together which made it difficult to compare results across countries. Other studies were small scale (Schmidt et al., 2011), qualitative, or conducted within the researchers’ own teacher education institutions only (Lerman, 2001; Adler et al., 2005). The same research deficits applied to how to design teacher education and which opportunities to learn (OTL) to offer. Only crude data existed about the effects of different components of teacher education on teacher competence which led to inconsistent results (Cochran-Smith & Zeichner, 2005). In many studies, only the type of teaching license or the number of courses taken during a teacher education program had been used to define OTL. These measures reflected the amount of content coverage without considering which content was offered, thereby ignoring qualitative similarities or differences between countries or teacher education programs. Data from a small comparative study on lower-secondary mathematics teacher education in six countries (Schmidt et al., 2011) had revealed that specific OTL profiles may exist per country and that these may be influenced by context characteristics: In five countries examined, the multiple institutions where teacher education took place tended to cluster together with respect to the OTL offered, suggesting country-level agreement and between-country heterogeneity reflecting cultural effects (Schmidt et al., 2008). Studying teacher education in an international context is a challenge. Differences in the structure of study programs and in the relevance attached to different aspects of the curriculum include a risk that the data gathered may not be comparable across countries (Akiba et al., 2007). At the same time, such differences in the structure and curriculum of teacher education make international comparative research valuable because they make hidden national assumptions visible about both what teachers are supposed to know and be able to do and how teacher competence is supposed to develop. Like the water in the fish’s tank, such cultural givens are too often invisible (Blömeke & Paine, 2008). Researchers are embedded in their own culture, and, thus, they are not necessarily able to recognize the specifics of it (Schmidt et al., 1996). By developing an ILSA, many matters are questioned which may remain unquestioned in national studies. Teacher education can possibly be organized very differently than a researcher is used to, and international comparisons provide the chance to move beyond the familiar and to perceive its characteristics with a kind of “peripheral vision” (Bateson, 1994). The results of a comparative study about teacher competence also provide benchmarks of what competence level and quality can be achieved during teacher education, and they may point to country-specific strengths and weaknesses. In many countries, the results of ILSAs on student achievement had led to fundamental reforms of the school system. The publication of the PISA 2000 results in Germany, for example, one of the first international studies the country took part in, and the

15

IEA’s Teacher Education and Development Study in Mathematics (TEDS-M)

329

realization that Germany performed at only a mediocre level – in contrast to the country’s self-image – came as a shock. Heated debates among policymakers, researchers, and the media resulted in substantial changes of the curriculum, accountability measures, and in part even the structure of the school system (Waldow, 2009). Similarly, the USA implemented significant reforms in its mathematics school curricula after the so-called Sputnik shock and the country’s weak performance in comparative studies such as SIMS (Pelgrum et al., 1986) and TIMSS (Mullis et al., 1997). Thus, TEDS-M as an ILSA targeting teacher competence provided the chance to rethink teacher education.

Structure of the Present Chapter TEDS-M followed the IEA tradition, known from studies such as TIMSS, of connecting educational opportunity and educational achievement, in this case OTL during teacher education and teacher knowledge and beliefs at its end (Tatto et al., 2008). In line with the OTL concept (McDonnell, 1995), the basic idea of TEDS-M was to describe important aspects of the structure, content, quality, and amount of teacher education and to determine whether cross-national differences in future teachers’ knowledge and beliefs at the end of their study programs were related to differences in these. Opportunities to learn mathematics, mathematics pedagogy, and general pedagogy were identified as the main components of mathematics teacher education both for future primary and lower-secondary teachers. The first section following describes the conceptual framework underlying TEDS-M with respect to these three components (mathematics, mathematics pedagogy, and general pedagogy). The section thereafter describes the study design, the instruments, and the methodological challenges. Then, the core results of TEDS-M are presented. TEDS-M built on a pilot study where many conceptual issues were clarified, first instruments developed and tested. This study, called “Mathematics Teaching in the twenty-first Century Study (MT21),” was carried out in six countries (Schmidt et al., 2011) and examined teacher education of lower-secondary mathematics teachers in Bulgaria, Germany, Mexico, South Korea, Taiwan, and the USA with convenience samples. Furthermore, TEDS-M triggered two new lines of research: one on other facets of teacher competence such as teachers’ skills to perceive, interpret and make decisions and another one on higher education outcomes in general, beyond teacher education. These precursors and follow-up studies of TEDS-M are briefly summarized in the final section of this chapter.

Conceptual Framework In order to examine systematically which factors may influence the development of future mathematics teachers’ knowledge and beliefs during teacher education, the TEDS-M conceptual framework identified potentially influential characteristics on three levels: the national country context (macro level), the institutional

S. Blo¨meke

330 Fig. 1 Hypothetical model of macro-, meso-, and microlevel characteristics influencing teacher knowledge and beliefs. (Based on Tatto et al., 2008)

Social, schooling and policy context Institutional characteristics of teacher education

Characteristics of teacher educators

Teachers’ characteristics at the beginning of teacher education

Teachers’ knowledge and beliefs at the end of teacher education

characteristics of teacher education (meso level), and the individual background of future primary and lower-secondary teachers at the beginning of teacher education (microlevel). The meso and the microlevels were hypothesized to mediate partly potential effects of the more distal macro level on teachers’ knowledge and beliefs as the outcomes of teacher education (see Fig. 1).

Macro Level: Countries’ National Context Countries’ social, schooling, and policy contexts were hypothesized to affect the development of teachers’ knowledge and beliefs during teacher education. The TEDS-M framework included in this part national policies related to teacher education and teachers as well as the structure of the school system and the structure of teacher education in terms of study programs, quality assurance of teacher education and licensing, and recruitment of future primary and lower-secondary teachers, including their selection into teacher education and later into the teaching profession (Tatto et al., 2008). Labor market issues such as teachers’ salary, their status in a society compared to other professions, retention in the teaching profession, and what it costs to train future mathematics teachers played major roles in the framework, too. Depending on where decisions were made within countries about such structural macro-level characteristics, information needed to be collected on several levels: at the national level in case of highly centralized countries and/or at the regional level in case of federal states and/or at the local level in case of highly decentralized countries. Since previous research had revealed that structural features, such as the type of degree received from a teacher education program, by itself did not have

15

IEA’s Teacher Education and Development Study in Mathematics (TEDS-M)

331

direct effects on teachers’ knowledge (Goldhaber & Liddle, 2011), this part of TEDS-M was first of all meant to provide the background information necessary to understand and interpret results from other parts of the study. In addition, macrolevel characteristics were hypothesized to shape the institutional characteristics of teacher education within each country.

Meso Level: Institutional Context According to the TEDS-M framework, the institutional characteristics of teacher education included the courses offered during a program but also the personnel involved and the resources available (Tatto et al., 2008). Teacher educators were hypothesized to be of crucial relevance for the development of future teachers’ knowledge and beliefs because – just like teachers at school – they deliver the OTL during a program. Derived from educational effectiveness models (Scheerens & Blömeke, 2016), a broad range of teacher educator characteristics were hypothesized to be related to teachers’ knowledge and beliefs by the TEDS-M framework, among others teacher educators’ formal qualifications and previous job experiences and their beliefs, teaching objectives, and teaching methods used (see Fig. 1). Future teachers’ OTL were framed as content coverage, as it was done in previous IEA studies on student achievement, specifically as “the content of what is being taught, the relative importance given to various aspects of mathematics and the student achievement relative to these priorities and content” (Travers & Westbury, 1989). OTL were in this sense defined as future primary and lower-secondary mathematics teachers’ encountering educational possibilities to learn about particular topics during teacher education. Since subject-matter specificity is the defining element of OTL (Schmidt et al., 1997), in the case of TEDS-M as a study about “learning to teach mathematics,” the topics reflected mathematics and mathematics pedagogy besides general pedagogical topics. These fields were hypothesized to deliver the body of deep knowledge necessary for a teacher to present mathematics topics to students in a meaningful way, to connect the topics to one another as well as to students’ prior knowledge (Clotfelter et al., 2006; Cochran-Smith & Zeichner, 2005). Besides the content delivered, the TEDS-M framework hypothesized that teaching methods experienced during teacher education would be relevant for the knowledge and beliefs developed (McDonnell, 1995). The idea of teacher education as a model for future teaching in class has always played an important role in pedagogical discourses, as it became visible, for example, in the theory of “signature pedagogies” developed by Shulman (2005). Furthermore, empirical studies had revealed that teaching methods providing the opportunity to engage in teaching practices, such as planning a lesson or analyzing student work, were indicators of high-quality teacher education (Boyd et al., 2009). OTL in teacher education can be regarded as having been intentionally developed by educational policymakers and teacher education institutions (Stark & Lattuca, 1997; Schmidt et al., 2008). Since the amount of time available during teacher

332

S. Blo¨meke

education is limited, every choice provides some OTL at the expense of others and gives study programs a characteristic shape. National and program specifications of OTL were therefore supposed to reflect visions of what primary and lowersecondary mathematics teachers were supposed to know and to believe before they entered a classroom at school and how teacher education programs should be organized to provide the knowledge and beliefs necessary for becoming a successful teacher (Darling-Hammond, 1996; Shulman, 1987).

Microlevel: Future Teachers’ Background In educational effectiveness studies, the data had revealed that school students’ background was almost always a powerful predictor of their achievement. Specifically with respect to mathematics, background characteristics such as students’ gender, socioeconomic status (SES), and language as well as their prior knowledge and motivation played important roles in explaining variance in achievement (Scheerens & Bosker, 1997). Similarly, the TEDS-M framework hypothesized that a range of teachers’ individual background characteristics would be relevant for the development of knowledge and beliefs during mathematics teacher education (Tatto et al., 2008). Mathematics has been regarded as a male-dominated subject for a long time (Burton, 2001). Longitudinal and trend studies revealed that even though differential mathematics achievement by gender has decreased over the past decades, females still showed lower achievement compared to their male counterparts in mathematics tests at high school and college (Hyde et al., 2008). The MT21 study, mentioned above as one of the precursors of TEDS-M (Schmidt et al., 2011), provided first evidence that gender-related achievement differences in mathematics may apply to teacher education as well. Already early studies revealed that school students’ SES was associated with their achievement (Coleman et al., 1966). SES represents access to resources important for learning like materials at home or parents’ education (Mueller & Parcel, 1981). Language represents a similar resource. In many countries, students whose OTL at school occurred in a second language performed worse than first-language learners (Walter & Taskinen, 2008). The magnitude of language disadvantages in classroom discourses increased with the difference between language skills sufficient for communication at home or with peers and the language proficiency necessary for educational success (Council of Chief State School Officers, 1990; Thomas & Collier, 1997). The TEDS-M framework hypothesized therefore such an effect also for teacher education. Finally, prior knowledge and motivation was hypothesized to affect the development of future teachers’ knowledge and beliefs by the TEDS-M framework. Generic and domain-specific prior knowledge had frequently been proven to be strongly associated with school students’ achievement (Simmons, 1995) so that not controlling for it would result in an overestimation of school effects (Goldhaber & Brewer, 1997). The same applied to students’ motivation. Furthermore, studies revealed that

15

IEA’s Teacher Education and Development Study in Mathematics (TEDS-M)

333

motivation was related to cognitive student outcomes, especially if the learning tasks were complex (Benware & Deci, 1984; Grolnick & Ryan, 1987) as we can assume them to be during teacher education. Based on the state of research (Ryan & Deci, 2000), the TEDS-M framework distinguished between several dimensions of motives to become a teacher: intrinsic-pedagogical motives, intrinsic subject-related motives (Watt & Richardson, 2007), and extrinsic motives related to job security and job benefits (Brookhart & Freeman, 1992).

Teacher Education Outcomes: Teachers’ Professional Knowledge Based on Shulman’s seminal work (1985), the TEDS-M framework hypothesized that future primary and lower-secondary mathematics teachers would need to acquire professional knowledge in several domains to be able to successfully deal with their job tasks. The knowledge domains were related to mathematics in terms of content knowledge and pedagogical content knowledge and defined as follows (Tatto et al., 2008): 1. Content knowledge described future teachers’ mathematics content knowledge (MCK). MCK includes fundamental mathematical definitions, concepts, and procedures. 2. Pedagogical content knowledge described future teachers’ mathematics pedagogical content knowledge (MPCK). MPCK includes knowledge about how to plan lessons and how to present mathematical concepts to heterogeneous student groups in terms of cognitive ability, including learning difficulties, or background, including gender and ethnic or linguistic diversity, using a wide array of instructional strategies (NRC, 2010). Teachers need to select and simplify content appropriately and connect it to teaching strategies. Moreover, teachers need to ask questions of varying complexity, identify common misconceptions, provide feedback, and react with appropriate intervention strategies. According to Shulman (1987), a third knowledge domain necessary for successful teaching was general pedagogical knowledge which involves “broad principles and strategies for classroom management and organisation that transcend subject matter” (p. 8), as well as knowledge about learners and learning, assessment, and educational contexts and purposes. This knowledge dimension, supposed to be relevant for generic teacher tasks such as classroom management and participating in general school activities, was not part of the core study in TEDS-M, but several participating countries developed national options.

Teacher Education Outcomes: Teachers’ Professional Beliefs In TEDS-M, beliefs were defined as “understandings, premises or propositions about the world that are felt to be true” (Richardson, 1996, p. 103). Beliefs rely on

334

S. Blo¨meke

evaluative and affective components (Rodd, 1997) and were regarded a crucial aspect of teachers’ perceptions of classroom situations (Leder et al., 2002). Thus, they were also hypothesized to be an indicator of teaching methods teachers would use in future classrooms (Nespor, 1987). Studies had revealed that there was a relation between teacher beliefs and student achievement if beliefs were looked at alongside both the subject being taught and a professional task which needed to be mastered (Bromme, 2005). Beliefs are, however, not a well-defined construct (Pajares 1992). Clear distinctions between beliefs and other concepts such as attitudes or perceptions are rare. At the same time, the relation between beliefs and knowledge – in particular pedagogical content knowledge and general pedagogical knowledge – is more heuristic than can strictly be maintained (Furinghetti & Pehkonen, 2002). Several efforts have been made to categorize the belief systems of teachers (Op’t Eynde et al., 2002). TEDS-M distinguished between epistemological beliefs about the nature of mathematics, teaching and learning of mathematics, and teachers’ evaluations of the effectiveness of teacher education and their own preparedness. Results from the MT21 study had revealed country-specific differences in future teachers’ beliefs (Schmidt et al., 2011): German, Mexican, and US future teachers agreed more strongly with statements that described the dynamic nature of mathematics than Taiwanese, South Korean, and Bulgarian future teachers. Results of TALIS referring to teachers’ epistemological beliefs on teaching and learning of mathematics pointed to the same direction: Teachers in Western countries agreed more strongly with constructivist views on teaching than teachers in South Asian and South American countries (Vieluf & Klieme, 2011).

Research Questions To sum up, the research questions of TEDS-M PRIMARY and TEDS-M SECONDARY were located at three educational levels: the national context of the participating countries (macro level), the institutional context where teacher education programs took place (meso level), and the individual future teachers with their background at the beginning and their knowledge and beliefs at the end of teacher education (microlevel). Additional research objectives were directed toward the relation between these levels. The core research questions can be described as follows (Tatto et al., 2008; Tatto, 2020): • Which knowledge and beliefs did future primary and lower-secondary mathematics teachers in the participating countries have at the end of teacher education? • Which OTL did teacher education provide to future primary and lower-secondary mathematics teachers? • What were the main characteristics of teacher educators providing study programs in the participating countries? • What were the main characteristics of teacher education institutions? • What were the costs of teacher education in the participating countries?

15

IEA’s Teacher Education and Development Study in Mathematics (TEDS-M)

335

• How did the social, school, and policy context vary across the participating countries? • How were future teachers’ background, OTL, teacher educators’, and teacher education institutions’ characteristics related to teachers’ knowledge and beliefs at the end of teacher education?

Study Design of TEDS-M Sampling Seventeen countries took part in TEDS-M. The two target populations were defined as future teachers in their final year of teacher education who would receive a license to teach mathematics in primary or lower-secondary schools (Tatto et al., 2008). A teacher education program was identified as primary school level if the qualification included one of the grades 1 to 4 (TEDS-M PRIMARY, primary or basic education, cycle 1; UNESCO, 1997) and as lower-secondary level if the qualification included grade 8 (TEDS-M SECONDARY, basic education, cycle 2; UNESCO, 1997). Fifteen countries took part in both TEDS-M studies (see Table 1), while Spain limited its participation to future primary and Oman to future lower-secondary teachers. In a two-stage process, random samples were drawn from these two target populations in each participating country (Tatto, 2013). The samples were explicitly or implicitly stratified according to important features of teacher education to reflect accurately future teachers’ characteristics at the end of their program. Such features were: – The type of program (“consecutive pathway” where specific teacher education courses were taken after a bachelor degree in a subject vs. “concurrent pathway” where subject-specific courses were taken together with teacher education courses)

Table 1 Participating countries in TEDS-M PRIMARY and TEDS-M LOWER-SECONDARY Botswana (census) Malaysia

Poland (concurrent programs only) Singapore (census)

Chile (census) Norway (partly overlapping samples) Russia Taiwan

Germany Oman (lowersecondary level only) Spain (primary level only) Thailand (census)

Georgia (census) Philippines

Switzerland (Germanspeaking regions only) USA (public institutions only)

Note. Canada was excluded from TEDS-M reports because the country did not meet the response rate requirements

336

S. Blo¨meke

– The school level to be taught (grade range included in the teaching license, e.g., grades 1 to 4 vs. grades 1 to 10 in TEDS-M PRIMARY or grades 5 to 10 vs. grades 5 to 12 in TEDS-M SECONDARY) – The amount of OTL in mathematics (with or without a minimum number of classes) – The region where a teacher education institution was based (e.g., federal states in Germany) As a first stage, institutions where either primary or lower-secondary teacher education programs (or both) took place were drawn with selection probabilities proportional to their size (Tatto, 2013). The number of participating institutions varied per country between 1 (Singapore) or 6–7, respectively (Canada, Botswana, and Oman), to 80 (Philippines). As a second stage, within each institution, a comprehensive list of eligible future teachers (FTs) was compiled. Then, at least 30 FTs were randomly selected from each target population (or all if fewer than 60 FTs were found in a given population). Furthermore, a comprehensive list of teacher educators was compiled within each institution from which again at least 30 educators were randomly selected from (or all if fewer than 60). While the IEA Data Processing and Research Center was responsible for selecting the samples of teacher education institutions, countries were responsible for selecting the samples of FTs and teacher educators within these institutions. The benchmark for every country’s sample size was the precision of a simple random sample of 400 future primary and 400 future lower-secondary mathematics teachers in their final year of teacher education on the national level. However, the sample size was lower in most countries because the target populations were relatively small. Four sets of weights were created, reflecting the four samples: – – – –

Teacher education institutions Teacher educators Future primary mathematics teachers Future lower-secondary mathematics teachers

The weights allowed for the estimation of country-level parameters from the samples and took the selection probabilities at the various sampling stages and the responses rates obtained into account (Tatto, 2013). Furthermore, estimations of standard errors taking the complex sample design into account were supported through the balanced repeated replication technique. In 2008, more than 24,000 future mathematics teachers from more than 500 teacher education institutions in 17 countries (see Table 1) were tested on their MCK and MPCK (and in several countries also on their GPK) with standardized paper-and-pencil assessments and surveyed on their beliefs, OTL, and background. The number of future primary mathematics teachers included in TEDS-M ranged from 86 in Botswana to 2,266 in the Russian Federation. The number of future lower-secondary teachers ranged from 53 in Botswana to 2,141 in the Russian Federation. Apart from Singapore where teacher education took place at one

15

IEA’s Teacher Education and Development Study in Mathematics (TEDS-M)

337

institution, the number of teacher education institutions ranged at the primary level from 4 in Botswana to 78 in Poland and at the lower-secondary level from 3 in Botswana to 48 in the Philippines and Russian Federation. In most countries, TEDS-M covered the full target population (Tatto, 2013). Canada, Switzerland, Poland, and the USA limited their participation because of difficulties reaching the target population. In Canada, only four out of ten provinces took part in TEDS-M (Ontario, Quebec, Nova Scotia, and Newfoundland and Labrador). In Switzerland, institutions in regions where German was the first language of instruction took part, whereas French- and Italian-speaking institutions were not covered. In Poland, only institutions offering concurrent teacher education programs took part in TEDS-M; consecutive programs were thus not covered. In the USA, only public teacher education institutions took part in TEDS-M, whereas private institutions were not covered. For Norway, several data sets were available for both TEDS-M PRIMARY and TEDS-M SECONDARY that were likely to overlap. Using one subsample would lead to biased country estimates; combining subsamples would lead to imprecise standard errors. The results for Norway should therefore be regarded as a rough approximation only. TEDS-M followed the common IEA quality requirements such as controlling translation, monitoring test situations, and meeting participation rates (Tatto, 2013). If a country missed the participation benchmarks only slightly in one of the target populations, results were reported without annotation. In case the combined participation rate of the two sampling stages falls below 75% but was still above 60%, results were reported annotated (“combined participation rate less than 75%”). Due to a participation rate of teacher educators in the USA below that threshold, these results were not reported. Because Canada was unable to meet the minimum requirements of IEA in any of the target populations, the country was excluded from the study. The final number of participating TEDS-M countries was thus 16 (15 in the primary and 15 in the secondary study), and the final sample size was 13,871 future primary and 8,185 future lower-secondary mathematics teachers. It is important to note that 15 countries constitute a small sample size so that it is difficult to draw general conclusions from the data. Furthermore, countries decided themselves about a participation in TEDS-M which means that it is a self-selected sample.

Instruments: National (Macro Level) and Institutional Context (Meso Level) National survey. This part of TEDS-M built mainly on information collected by others such as OECD and Eurydice or official national policy documents. The participating countries were expected to deliver national reports where they summarized this information and described the social, educational, and policy context of teacher education based on a guiding list of questions (Tatto et al., 2008). These were directed, for example, toward the legislative framework for teacher education, the structure and organization of teacher education, quality assurance arrangements and

338

S. Blo¨meke

program requirements, funding, and recent or planned reforms of teacher education. In addition, TEDS-M carried out a standardized analysis of national teacher education curricula to receive information about the intended OTL on the national level. Finally, the costs of teacher education were estimated based on information provided by the participating countries (Carnoy et al., 2009). Institutional survey. Information about the institutions where teacher education took place was collected with an institutional questionnaire to be filled in by the leading administrator of teacher education programs (Tatto et al., 2008). It included questions about the selection of future teachers (FTs) into teacher education, the structure and content of the different programs offered at the institution, the type and amount of field experiences required, accountability mechanisms, staffing, program resources, and reflections on the program (for a complete list of questions and items, see Tatto, 2013). Teacher educator survey. A survey administered to teacher educators collected information about their formal qualifications and previous job experiences but also about their beliefs, teaching objectives, and the OTL they provided (for a complete list of scales and items with country-specific descriptives and international means, see Laschke & Blömeke, 2013; see also Brese, 2012c). Several of the beliefs and OTL scales were identical to those given to FTs. Curriculum analysis. A standardized curriculum analysis provided information about the intended curriculum on the level of teacher education institutions. A census of the syllabi for required and relevant elective courses in university mathematics, school mathematics, mathematics pedagogy, general pedagogy, and practical experiences in schools were collected from the participating institutions (Tatto & Hordern, 2017).

Instruments: Future Teacher Surveys (Microlevel) Opportunities to learn. TEDS-M intended to compare OTL during primary and lower-secondary teacher education across countries. Besides the curriculum analysis and teacher educators’ reports, FTs provided information about their individual OTL. The items included in the future teacher survey listed content topics, teaching methods, OTL related to typical job tasks of a teacher, and field experiences in mathematics, mathematics pedagogy, and general pedagogy. The items were the same for primary and lower-secondary teachers. FTs had either to indicate whether they had “studied” or “not studied” them or to rate these on four-point Likert scales from “never” to “often.” Overall, about 130 items were listed in the survey (for a complete list of items, see Brese, 2012c; Laschke & Blömeke, 2013). National expert reviews and pilot studies were carried out to support cultural validity of the items in all participating countries. Additional validity evidence was collected by correlating FTs’ data to the data from the standardized curriculum analyses (e.g., Blömeke et al., 2010). OTL in tertiary-level mathematics were assessed with 19 items covering discrete structures and logic (e.g., “linear algebra”), continuity and functions (e.g., “multivariate

15

IEA’s Teacher Education and Development Study in Mathematics (TEDS-M)

339

calculus”), geometry (e.g., “axiomatic geometry”), and probability and statistics (e.g., “probability”). OTL in school-level mathematics were assessed with seven items covering typical content areas such as numbers or calculus. OTL in mathematics education were surveyed with eight items covering foundations (e.g., “development of mathematics ability and thinking”) and instruction (e.g., “mathematics standards and curriculum”). OTL in general pedagogy were assessed with eight items in two subdimensions, namely, foundations (e.g., “philosophy of education”) and applications (e.g., “educational psychology”). Teaching methods experienced in teacher education classes were captured with 15 items in 2 generic and 1 mathematics-related subdomains, covering class participation (e.g., “ask questions during class time”), class reading (e.g., “Read about research on . . .” [several items with a variety of areas]), and solving problems (e.g., “write mathematical proofs”). In addition, three large sets of 26, 10, or 12 items, respectively, captured the OTL related to job tasks such as instruction, assessment, and developing one’s own teaching practice. Seven subdimensions were covered: instructional practice (e.g., “learn how to explore multiple solution strategies with pupils”), instructional planning (e.g., “create projects that motivate all pupils to participate”), teaching for diversity (e.g., “develop specific strategies and curriculum for teaching . . .” [several items targeting a variety of students]), assessment practice (e.g., “assess higher-level goals”), assessment use (e.g., “give useful and timely feedback to pupils about their learning”), teaching for improving practice (e.g., “connect learning across subject areas”), and teaching for reflecting on practice (e.g., “develop strategies to reflect on . . .” [several items targeting a variety of areas]). Finally, a broad range of single items and scales covered the field experiences of future primary and lower-secondary mathematics teachers in schools. Core scales assessed the quality of these OTL and the coherence of the different elements of a teacher education program. The quality scales were among others directed toward the feedback quality of the supervising teacher during the practicum (e.g., “The feedback I received from my supervising teacher helped me to improve my teaching methods.”). Program coherence was covered by the degree to which FTs had a chance to connect classroom learning to practice (e.g., “test out findings from educational research about difficulties pupils have in learning in your courses”), the degree to which they had experienced reinforcement of university goals for practice by the supervising teacher during their practicum (e.g., “My school-based valued the ideas and approaches I brought from my teacher education program.»), and an evaluation of program coherence in general (e.g., “Later in the program built on what was taught in earlier in the program.”). For a complete list of all items, see Laschke and Blömeke (2013) and Brese (2012c). Teacher knowledge: MCK and MPCK. One core outcome of teacher education, assessed in all participating countries, was the domain-specific knowledge of future primary and lower-secondary mathematics teachers in terms of their MCK and MPCK. Two 60-minute standardized paper-and-pencil assessments were developed that had to be completed during monitored test sessions (Tatto, 2013). The items were intended to depict the knowledge underlying teachers’ classroom performance

340

S. Blo¨meke

in grades 1 through 4 or grade 8 as closely as possible (National Council of Teachers of Mathematics, 2000). The item development had to follow the TEDS-M conceptual framework (Tatto et al., 2008), and all participating countries sent in item suggestions to avoid cultural bias. The item pool was reviewed by large groups of experts, first within the participating countries and then cross-nationally. All national TEDS-M coordinators approved the final version of the knowledge assessments as reflecting appropriately the respective national teacher education curricula. The 74 mathematics items included number (as that part of arithmetic most relevant for teachers; 25 items), algebra (23 items), and geometry (21 items), with each set of items having roughly equal weight, as well as a small number of items about data (as that part of probability and statistics most relevant for teachers; 5 items). Three cognitive dimensions were covered: knowing, i.e., recalling and remembering (33); applying, i.e., representing and implementing (29); and reasoning, i.e., analyzing and justifying (12). The 32 mathematics pedagogy items assessed two subdimensions, namely, pre-active curricular and planning knowledge (16 items) which is necessary before a mathematics teacher enters the classroom (e.g., establishing appropriate learning goals, knowing different assessment formats, or identifying different approaches for solving mathematical problems) and interactive knowledge about how to enact mathematics for teaching and learning (16 items; e.g., diagnosing students’ responses including misconceptions, explaining mathematical procedures, or providing feedback). In addition, the MPCK test covered the three cognitive dimensions mentioned above. The majority of MCK and MPCK items were complex multiple-choice items; some were partial-credit items. A matrix design with five (future primary teachers) or three (lower-secondary) test booklets of the type “balanced incomplete block design” was applied to capture the desired breadth and depth of teacher knowledge. About a quarter of the TEDS-M items – 24 MCK and 10 MPCK from the primary and 23 MCK and 9 MPCK items from the lower-secondary study – have been released by the IEA (see Brese, 2012b; Laschke & Blömeke, 2013). Three quarters of the items were kept to make possible a link between TEDS-M and a potential follow-up study. Teacher knowledge: GPK. General pedagogical knowledge was not part of the TEDS-M core, but Germany, Taiwan, and the USA developed together an instrument that was included in the TEDS-M assessments as a national option (König et al., 2011). It consisted of 77 test items. These included dichotomous and partialcredit items as well as open-response (about half of the test items) and multiplechoice items. Four dimensions of GPK were covered: knowledge about how to prepare, structure, and evaluate lessons (structure), how to motivate and support students as well as manage the classroom (motivation/classroom management), how to deal with heterogeneous learning groups in the classroom (adaptivity), and how to assess students (assessment). Similar to the MCK and MPCK assessments, in addition three cognitive processes were used to balance the item composition: FTs had to retrieve information from long-term memory (recalling), to understand or analyze a concept or term

15

IEA’s Teacher Education and Development Study in Mathematics (TEDS-M)

341

(understanding/analyzing), or they were asked to generate strategies on how they would solve a classroom situation (generating). Items were equally distributed across the four content and the three cognitive subdimensions. Following the MCK and MPCK test design, five (TEDS-M PRIMARY) or three (TEDS-M SECONDARY) booklets in a balanced incomplete block design were used. Several expert reviews in the USA, Germany, and Taiwan, as well as two large pilot studies, were carried out to prevent cultural bias. For the open-response items, coding rubrics were developed (for more details, see König et al., 2011). Future teachers’ mathematics-related beliefs. Future primary and lowersecondary teachers’ beliefs related to their tasks as mathematics teachers were assessed in five subdimensions: epistemological beliefs about the nature of mathematics (two subdimensions), about the teaching and learning of mathematics (two), and about mathematics ability. FTs’ beliefs about the nature of mathematics were assessed adapting an instrument developed by Grigutsch et al. (1998). The items had a two-dimensional structure representing a static and a dynamic view on mathematics, and future primary and lower-secondary teachers had to express their agreement on a six-point Likert scale from strongly disagree to strongly agree. A dynamic view focused on mathematics as a process of inquiry. The scale’s six items emphasized the process- and application-related character of mathematics, for example, “in mathematics you can discover and try out new things by yourself” or “many aspects of mathematics are of practical use.” A static view focused on mathematics as a set of rules and procedures. This scale’s six items stressed the importance of definitions and formulae, for example, “mathematics is a collection of rules and procedures that prescribe how to solve a problem” or “logical rigor and precision are fundamental to mathematics.” FTs’ beliefs about the teaching and learning of mathematics were assessed with two scales from instructional research (Peterson et al., 1989). The first scale represented a constructivist view. Strong agreement meant that teachers regarded mathematics learning as an active process in which students developed their own inquiries and approaches to problem-solving. Two examples of these items were “In addition to getting the right answer, it is important to understand why the answer is correct” and “Teachers should allow pupils to develop their own ways of solving mathematical problems.” In contrast, the second scale displayed mathematics learning through teacher direction. Two examples of these items were “The best way to do well in mathematics is to memorize all the formulae” and “Pupils need to be taught exact procedures for solving mathematical problems.” Finally, one scale assessed whether future teachers regarded mathematics achievement as a fixed ability. The eight items were developed based on Stipek et al. (2001) in the context of the MT21 project (Schmidt et al., 2011) and had to be rated on a six-point Likert scale from strongly disagree to strongly agree. An example item was “Mathematics is a subject in which natural ability matters a lot more than effort.” Future teachers’ beliefs evaluating their teacher education. Several scales assessed how FTs perceived their education. A first indicator was the preparedness

342

S. Blo¨meke

for teaching mathematics which was covered with 13 items that had to be rated on a four-point Likert scale from “not at all” through “to a major extent.” An item example was “Establish appropriate learning goals in mathematics for pupils.” In addition, they had to evaluate the quality of instruction with six items to be rated on a six-point Likert scale from strongly disagree to strongly agree. An example item was “Model good teaching practices in their teaching.” Furthermore, FTs had to provide information about circumstances that hindered their studies, such as family responsibilities, that they had to borrow money or had a job in parallel to teacher education, and to provide an estimate of the overall effectiveness of teacher education. Future teachers’ background. FTs’ age was covered with an open-ended question. Gender was reported dichotomously with two values (female, male). The language spoken at home in contrast to the official language of instruction in teacher education was captured with a four-point Likert scale (“never” through “always”). Parent education was used as one of three indicators of FTs’ socioeconomic status (SES). It was separately measured for fathers and mothers on scales covering the seven most important ISCED levels (“primary” through “beyond ISCED 5A”). Another SES indicator was the number of books at FTs’ parents’ home, to be rated on a scale with five points ranging from 0–10 to more than 200 books. The third SES indicator was which cultural items FTs’ parents had at home, such as computer, dictionary, or cars. Self-reported high-school achievement was used as a proxy for FTs’ generic prior knowledge at the start of teacher education. It was measured across school subjects with a five-point Likert scale where FTs had to compare themselves with their age cohort (“generally below average” through “always at the top”). Domain-specific prior knowledge was surveyed through the number of mathematics classes taken during K-12 schooling (K-12 ¼ during primary, lower-secondary, and uppersecondary schooling; five-point Likert scale from “below year 10” through “year 12 (advanced level)”). In addition, there was a country-specific question about the most advanced mathematics class taken at upper-secondary school. The motives to become a teacher were captured in three subdimensions: intrinsicpedagogical, intrinsic-academic, and extrinsic motivation. Four, two, or three statements, respectively, had to be rated on four-point Likert scales (“not a reason” through “a major reason”). An indicator of intrinsic-pedagogical motives was “I like working with young people.” An indicator of intrinsic-academic motives was “I love mathematics,” and an indicator of extrinsic motives was “I seek the longterm security associated with being a teacher.” FTs were also asked whether they had had another career before they started teacher education and whether they saw their future in teaching.

Data Analysis With respect to FTs’ knowledge, scaled scores were created separately for MCK and MPCK in one-dimensional models using item response theory. The achievement scores were transformed to a scale with an international mean of 500 test points and a

15

IEA’s Teacher Education and Development Study in Mathematics (TEDS-M)

343

standard deviation of 100 test points. For dichotomous items, the standard Rasch model and for polychotomous items the partial-credit model were used (Tatto et al., 2012). With respect to FTs’ beliefs and OTL, initial exploratory factor analyses were carried out based on data from an extensive pilot study, followed by confirmatory factor analyses based on data from the main TEDS-M study. The structure of the scales was consistent across studies and target populations which provided validity evidence regarding the construct definitions. To assess the degree to which factor structures were measurement invariant across countries, multigroup confirmatory factor analysis (MCFA) was used. The results provided evidence of the fit of the given factor structure in most countries (Tatto, 2013). The raw data were scaled using a partial-credit IRT model (Tatto et al., 2012). The belief and OTL scores were transformed to a scale with a mean of 10 representing a neutral view. The methods used to analyze the TEDS-M data depended on the research questions. Given that the data had a nested structure with FTs nested in study programs nested in teacher education institutions nested in countries, often multilevel modeling was used in studies examining questions where several levels were involved to obtain correct standard errors (Hox, 2002). With such an approach, it was also possible to use covariates at any level of the hierarchy. Multilevel modeling with TEDS-M data was often carried out in a variable-oriented way (e.g., Blömeke et al., 2012). Latent class or profile analysis, in contrast, was rarely applied (e.g., Blömeke & Kaiser, 2012). All research studies using TEDS-M data were supposed to incorporate weights to reflect uneven selection probabilities and nonresponse rates so that robust population estimates could be obtained (Meinck & Rodriguez, 2013). The unit of analysis varied a lot across studies depending on the research question, for example, countries, teacher education institutions, or the so-called teacher preparation units (TPU) which described teacher education programs within institutions. For more information about the TEDS-M database and how to use it, see Brese (2012a). The user guide contains documentation on all derived variables made available for secondary analyses.

Results A systematic literature review revealed a large number of studies using data from TEDS-M. These covered macro-, meso-, and microlevel factors as well as their relations. Studies on cognitive outcomes of teacher education dominated, but several papers examined also their relation to OTL. An overview of the TEDS-M conceptual framework, instruments, and results was provided in Tatto et al. (2012) and Blömeke et al. (2014). Table 2 provides a summary of 19 core articles beyond these publications, identified in the literature review. Fifteen of them were selected because they appeared in international journals with peer review and listed in the Social Science Citation Index: 12 of the articles appeared in the Journal of Teacher Education, Teaching and Teacher Education, International Journal of Science and Mathematics

Research type Relation of macro-level factors to outcomes

Reference Ingvarson and Rowley (2017)

Research objectives • Description of quality assurance related to recruitment and selection of FTs (control over number of study places, status of teaching profession, prerequisites for entry to teacher education), accreditation of teacher education programs, certification, and entry to the profession • Relation between quality assurance and teacher education outcomes (MCK, MPCK) by study programs Relation between quality assurance and students’ mathematics achievement (TIMSS, PISA) by countries

Sample TEDS-M PRIMARY TEDS-M SECONDARY 16 countries ~22,000 FTs ~500 teacher education institutions ~750 TPUs

Method Scoring of quality assurance as “high,” “medium,” “low” Composite measure “recruitment and selection” Overall composite measure averaging recruitment and selection, accreditation of programs, and entry to profession

Table 2 Nineteen core TEDS-M publications identified in a systematic literature research Results Quality assurance arrangements related to the recruitment, selection, preparation, and certification of FTs varied considerably across the 17 TEDSM countries Significant countrylevel relations between the strength of quality assurance arrangements and both FTs’ MCK and MCPK and students’ TIMSS and PISA achievement in mathematics existed

Conclusions An integrated approach to ensuring the quality of future teachers with shared responsibilities for recruitment and selection, accreditation of programs, and entry to profession would likely lead to a system that supports high-quality instruction and student learning

344 S. Blo¨meke

Kaiser and Blömeke (2013)

Felbrich et al. (2012)

Relation of macro-level factors to outcomes

Relation of macro-level factors to outcomes

Level and pattern of epistemological beliefs about the nature of mathematics across

Comparing achievement and beliefs of FTs from Eastern and Western countries Analyzing cultural differences between Eastern and Western countries with respect to FTs’ achievement and beliefs

TEDS-M PRIMARY 15 countries 13,871 FTs

TEDS-M PRIMARY TEDS-M SECONDARY 16 countries 13,871 + 8,185 FTs

CFA supporting the two-dimensional structure of the six items each rated on six-point Likert

Summary of empirical TEDS-M studies

• Huge MCK and MPCK differences: between countries Taiwan, Singapore highest achievement; Russia, Thailand surprisingly high in relation to Human Development Index • Taiwan, Thailand, Russia, Poland, Germany, Switzerland stronger in MCK than MPCK Norway, the USA, Spain, Chile, Malaysia, the Philippines stronger in MPCK than MCK • Dominance of constructivist beliefs in Western European countries, Taiwan; of transmission-oriented beliefs in Eastern European, Southeast Asian countries Beliefs varied within and between countries: FTs from the Philippines,

IEA’s Teacher Education and Development Study in Mathematics (TEDS-M) (continued)

On the level of countries, the beliefs pattern seemed to correspond with the individualistic-

Effective teaching and learning environments may have different shapes in different countries It is not appropriate to transfer isolated measures from one educational system to another International comparative studies have the potential to reveal an unbalanced view of one’s own culture The Eastern and Western debate enables both to learn from each other

15 345

Reference

Carnoy et al. (2009)

Research type

Relation of macro-level factors to outcomes

Table 2 (continued)

Comparing salaries of primary and secondary school teachers with mathematics-oriented professions

countries (mathematics as a process of inquiry, mathematics as a set of rules and procedures) Examining to what extent beliefs are influenced by the extent to which a country’s culture can be characterized by an individualistic versus collectivistic orientation according to Hofstede’s terminology

Research objectives

20 developing and developed countries

Sample

Estimation of age-earnings curves for men and women in mathematicsintensive occupations and teaching by level

scales Rasch scaling, scores transformed to a mean value of 10 Two-level modeling (FTs, countries)

Method Thailand, Malaysia, and Botswana agreed with static and dynamic beliefs Agreement of FTs from Chile, Germany, Switzerland, Norway, and Taiwan to static beliefs and from Germany, Switzerland, Norway, Georgia, Russia, and Poland to dynamic beliefs below international mean • FTs in individualistic countries preferred dynamic over static, FTs in collectivistic countries static over dynamic beliefs The ratio of median earnings of teachers and mathematicsoriented professions revealed three sets of countries with

Results

Higher teacher pay may draw individuals with higher mathematics skills into teaching, thereby resulting in

collectivistic orientation of a country

Conclusions

346 S. Blo¨meke

Description of macro-, meso-, and microlevel factor

Tatto and Senk (2011)

Describing structure of teacher education • Describing OTL tertiary-level and school-level mathematics • Describing FTs’ MCK

(engineering, scientific fields, accounting) • Examining the relationship between teacher salaries and student achievement in TIMSS and PISA • Examining the relation of student achievement to income inequality

TEDS-M PRIMARY TEDS-M SECONDARY 17 countries 15,163 + 9,389 FTs 500 teacher education institutions

Rasch scaling of MCK Qualitative descriptions of anchor points

of education (first university degree, master’s degree) for 2 years (1990s, 2000, or beyond) • Regression of student achievement on income inequality (most recent Gini coefficient), controlling for GDP per capita in purchasing power parity dollars for 2000 relative high (>90% of a scientist’s salary), medium, (80–90%) and low salaries (47

Yang et al., (2019)

Maths

2

8

2/38

Klette et al., (2017)

Maths, LA

5

8 (7)

3-4/10

None

Opfer et al., (2020)

Maths

8

8

2/50>80

Quadratic Developed equations their own framework

TIMSS 1999

TEDSInstruct 2014–2017 LISA Nordic 2019–2022 OECD GTI 2018–2021

Lessons/ Classrooms per country 1/ 50>80

PLATO framework

Depending on whether students start school at the age of 6 or 7 years old (grades 7–9/ grades 8–10)

countries for the purpose of cross-contextual and international comparison. In this overview, I concentrate on international large-scale video studies and will refer to national large-scale video studies only when relevant. As a lens on instructional quality, video observation can be used for different purposes, including research on teaching and learning, teacher evaluation, and teacher professional development. In this overview, I concentrate on research on teaching and learning and only briefly summarize the use of video observations for the purpose of teacher professional learning. For those interested in videos for the purpose of teacher evaluation, confer Rowan and Raudenbush (2016), Cohen and Goldhaber (2016), and Mikeska et al. (2018). In this chapter I argue that developments in observation design, and especially developments in video design and video technology, have contributed to a “revolution” in studies of classroom teaching and learning. Development in technologies (e.g., small, miniaturized, and nonintrusive cameras; available software tools targeting video analyses, such as Interact, Observer, Study Code) combined with the use of digital devices and social media in everyday situations (including video technology as a means of communication) have paved the way for a new generation of classroom studies and comparative studies using videos to measure and understand teaching quality across contexts. The first and second Trends in International

18

The Use of Video Capturing in International Large-Scale Assessment. . .

473

Mathematics and Science Survey (TIMSS) video studies (Stigler & Hiebert, 1997, 1999) represented pioneering work in this sense, using video clips from three and seven countries, respectively, to document and study teachers’ instructional repertoires around the world. Today, classroom video capturing is a part of the regular methodological toolkit when trying to understand features of teaching and learning across contexts. However, video studies are still rare at scale and remain a newcomer in the landscape of international large-scale assessment research. In the chapter, I summarize what we know so far when using video observation for large-scale studies on teaching quality. I first discuss issues and challenges to consider when aiming to use videos as an approach to understand teaching quality; specifically, these issues include the purpose of the study, the teaching practices captured, the theoretical frameworks of analyses, methodological dilemmas (e.g., subject specificity, the grain size of analyses, scoring specifications, and requirements), and the number of lessons necessary for a valid assessment. I then discuss briefly how developments in technologies and social media have paved the way for increased use of video capturing as a means of retrieving and analyzing teaching quality cross-nationally. In the last section, Moving Forward, I discuss ways of moving forward, including how to use video documentation for the purpose of longitudinal studies, teacher evaluation and teachers’ professional development.

Issues and Challenges to Consider In this section, I describe nine issues relevant to consider when using video as a means of assessing teaching and learning in large-scale studies. The different issues span from methodological challenges such as how to sample classrooms and select a suitable analytical framework for analyzing the videotaped lessons to theoretical aspects such as the views of teaching and learning underpinning the different analytical frameworks as well as their empirical evidence. I do not aim to cover all possible issues, and subsequent and emergent reviews may productively expand or revise this overview. Given existing international experience of developing and using large data sets of videos as lenses to analyze, understand, and evaluate classroom teaching (Bell, 2021a; Clarke et al., 2006b; Janik & Seidel, 2009; Kane et al., 2012; Klette et al., 2017; Opfer et al., 2020; Stigler & Hiebert, 1999), I focus on issues summarized as essential when assessing classroom teaching and learning in the literature and that vary across designs and studies. The nine issues build on existing research summarizing key challenges when using video data (Bell et al., 2019; Derry et al., 2010; Goldman et al., 2007, Klette & Blikstad-Balas, 2018; Praetorius et al., 2019) and paraphrase the four broader categories suggested by Schoenfeld et al. (2018) as critical when evaluating different observation designs and analytical frameworks drawing on video data. Following Schoenfeld et al. (2018), four categories should be considered when evaluating different observation frameworks: (a) how the framework relates to theories of domain proficiency, (b) how the framework is decomposed into elements that count and are easy to understand and use (parsimonious), (c) whether the framework is timely and fruitful for the intended

474

K. Klette

purposes, and (d) how the framework is quantitatively robust in terms of measurement and empirical evidence. I will summarize the nine issues within the following broad headlines: (a) purpose of the study, (b) dimensions of teaching practices captured, (c) theoretical underpinning (views of teaching and learning), (d) subject specificity, (e) grain size, (f) scoring specifications, (g) focus on students or teachers (or both), (h) the connection between teaching facets measured and student outcomes (empirical evidence), and (i) ethics. Below, I elaborate on each issue and provide relevant methodological and empirical illustrations. Of course, challenges of measurement invariance and rater error come into play when conducting large-scale comparative assessment video studies. Due to page restrictions, we refer readers to ▶ Chap. 31, “Cross-Cultural Comparability of Latent Constructs in ILSAs,” by He, Buchholz and Fischer, in this volume for more elaborated readings on this.

Purpose of the Study Observation has a longstanding position in education when aiming at capturing facets of teaching and learning, as it allows for detailed and contextual data regarding both teachers and their students. A major challenge when conducting large-scale assessment video studies are however, how to produce generic data that at the same time provide sufficient contextual information (Luoto et al., 2022) relevant for the purpose. As indicated in the introduction, video capturing from classrooms can be used for multiple purposes. Teachers might observe each other, principals might observe their teachers, teacher educators and teacher mentors might observe student teachers, and researchers might visit classrooms to identify possible relations between teaching and learning. The predominant reason to use large-scale assessment studies of teaching and learning drawing on video data is for research purposes. Only a few large-scale studies have drawn on video capturing for the purpose of teacher evaluation (Martinez et al., 2016), to improve teaching and/or pursue reform efforts (Taut et al., 2009; Taut & Rakoczy, 2016), and for professional development (Borko et al., 2010; Hill et al., 2013; Gaudin & Chaliès, 2015). However, few of them are really at scale in terms of involving multiple classrooms, lessons, and teachers across several countries and economies. More important, even fewer video studies focused on professional development qualify as international large-scale assessment studies. Conversely, we see an increase in using videos at scale for the purpose of research. Since the turn of the millennium, a small number of video studies have been conducted to understand, analyze, and improve aspects of teaching and learning with a comparative ambition – be it within a specific subject domain or with a focus on generic aspects of teaching. Examples of such studies are the Learner’s Perspective Study (LPS) involving 13 economies/countries (Clarke et al., 2006b), the Pythagoras study from Germany and Switzerland (Lipowsky et al., 2009), and the Quality in Physics study (QuIP), drawing on video data from Germany, Switzerland, and Finland (Fischer et al., 2014) (When resuming the field here I concentrate on studies designed as large scale comparative video studies and which includes two or more countries. Studies applying a pre-defined observation instrument such

18

The Use of Video Capturing in International Large-Scale Assessment. . .

475

as the ISTOF observation Instrument (Teddlie et al., 2006) in live classrooms (see Mujies et al., 2018) will not be included in this summary.). In addition, the Teacher Education and Development Follow-Up study (TEDS-Instruct) (Yang et al., 2019) used video data to compare mathematics teaching in China and Germany, the Linking Instruction and Student Achievement (LISA) Nordic study drew on video capturing from the Nordic Countries (Klette et al., 2017), and the recent OECD Global Teaching InSights (GTI) video study relied on video observation from eight countries with approximately 80 classrooms from each country (Opfer et al., 2020; OECD, 2020). All of these studies might be termed international video studies as they draw on data from more than one country. Case studies of instruction in two or more countries using videos will not be included in this overview as they do not meet the criteria of being large-scale and have less rigorous and standardized criteria when it comes to features of analyses. Large-scale studies differ from one another in scope and focus (e.g., subject specificity), countries involved, grade level visited, amount of data gathered (e.g., number of lessons and classrooms videotaped), and analytical frameworks used for analyses. Several of these studies draw on data from mathematics and/or science classrooms, as video studies in these subject areas have been seen as especially relevant and feasible for studying teaching quality around the world, see for example the LPS study, the Pythagoras study, and the OECD GTI video study. The TIMSS studies (focusing on mathematics in 1995 and mathematics and science in 1999) might have influenced the interest and enhanced focus on mathematics classrooms as they provide possibilities for comparing and contrasting classrooms (Clarke et al. 2006a; Leung, 1995; Seidel & Prenzel, 2006). Also, the assumption that mathematics is believed to be the least context-specific school subject (Schweisfurth, 2019) might have spurred an interest in analyzing mathematics comparatively and at scale. Drawing on the initial TIMSS analyses and findings might, in addition, pave the way for the possibility of longitudinal analyses of mathematics classrooms (see, for example, Givvin et al., 2005; Leung, 1995; Stigler & Hiebert, 1999). While the Global Teaching InSights video study developed a common observation manual (e.g., “observation system”; see Bell et al., 2019; Hill et al., 2012b) when analyzing the data, the LPS study was designed for multiple and contextual analyses. A key rationale behind the LPS design was to enable various and multilayered analyses while being sensitive to contextual factors and differences. One of the key contributions from the LPS study is what Clarke and colleagues (Clarke et al., 2012) term the “Validity-Comparability Compromise” (page 2/172). Drawing on this concept, they argue that the pursuit of commensurability through the imposition of a general classificatory framework can misrepresent the way in which valued performances and school knowledge are conceived by each community and sacrifice validity in the interest of comparability (Clarke et al., 2012). This argument resembles the issue of fairness, proposed by Schoenfeld et al. (2018) to address how legitimate and fair different frameworks are in terms of national and cultural variations, such as national policies for teaching. Table 1 summarizes key characteristics of the different studies, such as purpose, subject area(s) covered, countries involved, grade level, number of lesson videotaped per teacher, and analytical framework used.

476

K. Klette

Although designed for a specific purpose, video data might be seen as “raw-data” (Fischer & Neuman, 2012, p. 120) that can be used widely, such as for professional development and teacher training, as with data from the TIMSS video study (Kerstin et al., 2010) and the LISA study (Klette et al., 2017). The scope and limitations for the use of the video data are, however, restricted by issues of privacy, anonymization, and consents from those participating in the study. For example, the General Data Protection Regulation (GDPR) introduced to regulate issues of data privacy in Europe May 2018 (Regulation [EU] 2016/679 General Data Protection Regulation; Griffin & Leibetseder, 2019), put severe limitations in how to use the data if not clarified and approved by participants in the initial consent.

Dimensions of Teaching Practices Captured When setting out to analyze and compare dimensions of teaching and instruction across contexts, it is important to agree on critical aspects, such as subjects and/or topics in focus, units of analyses, grain size of comparison, and facets of teaching practices to be captured (often termed as domains and dimensions/subdimensions in the different observation manuals and frameworks). Depending on the purpose and focus of the study, researchers either use existing observation manuals and rubrics or developed their own manual and rubrics. Observation rubrics identify a set of facets of teaching in focus and link them to a set of criteria and/or scoring procedures. As such, observation manuals are not only a “sheet of paper with rubrics and checklists detailing specific scales” (Liu et al., 2019, p. 64), but rather an integrated system of observation procedures, scoring specifications, and (sometimes) required training of raters (Bell et al., 2019; Hill et al., 2012b). All such scoring systems and checklists, hereafter termed “observation systems,” prioritize certain features of instruction and exclude others, embodying a certain community’s view of instructional quality (Bell et al., 2019). While there is an increasing interest in using standardized observation systems in classroom research in general, several researchers develop their own instruments and/or use non-standardized, potentially informal instruments (Bostic et al., 2019; Stuhlman et al., 2010). Those using the latter approach have argued that the instrument should be sensitive to contextual factors as well as the specific research questions and ambitions related to the respective study. This type of “bottom-up” approach to conceptualizing teaching quality builds the definition of teaching quality from the data (Praetorius et al., 2019). The aforementioned LPS study was designed with such a bottom-up approach (Clarke et al., 2006a), providing the researchers with a design that paid attention to contextual specificities; thus, the core comparative design element was linked to data-gathering procedures rather than procedures of analyses. Informal and non-standardized measurements of instruction may be useful for contextual purposes and for identifying features in classrooms that otherwise would be overlooked in a standardized observation system. Reflecting on findings from the TIMSS study (Stigler & Hiebert, 1999), Stigler and Miller (2018) discuss how teaching is a deeply cultural activity, developed and routinized

18

The Use of Video Capturing in International Large-Scale Assessment. . .

477

over time with specific cultural traditions and patterns. As a result, they argue that it is easy to misinterpret and colonialize patterns of instruction that may appear strange to those with a different tradition of schooling. However, informal and contextsensitive instruments make it difficult to systematically capture and analyze patterns of teaching and instruction across multiple classrooms and aggregate consistent knowledge across studies; as such, these studies are in danger of representing researchers’ or research communities’ particular understanding of quality teaching rather than an empirically validated understanding (Praetorius et al., 2019). Having said that, I need to underscore that any instrument, whether predefined (top-down) or context-sensitive (bottom-up), will capture only the parts of teaching regarded as important according to the specific instrument. Further, there is no “best” instrument – instruments, like methods, have affordances and constraints. Researchers must understand these complexities and choose or develop instruments that suit their purposes and intended claims. Standardized observation systems, which are predefined, and validated observation instruments have, I will argue, significant methodological benefits in general and in video studies in particular. They facilitate a clear categorization and standardization of the massive amount of data video studies tend to generate. Standardized observation systems are top-down approaches to studying teaching quality, drawing on already existing conceptualizations of teaching. They come with theoretical underpinnings and predefined rubrics serving as lenses for understanding teaching and instruction in classrooms (Klette & Blikstad-Balas, 2018; Praetorius & Charalambous, 2018). Worth mentioning here, however, is how the recent GTI study used a “combined approach” (Bell, 2021a; Tremblay & Pons, 2019) when developing their observation tool. The authors drew on existing observation systems and instruments alongside bottom-up approaches that were sensitive to teaching practices regarded as valued in the eight participating countries without losing comparative power. Broadly scoped, standardized observation instruments are useful because they can provide similarities in the meaning of scores across a range of contexts, which again might support integration and accumulation of knowledge in comparative studies of teaching quality (Grossman & McDonald, 2008; Klette, 2022; Pianta & Hamre, 2009). However, Bell, Qi, Croft, Leusner, Gitomer, McCaffrey, and Pienta (2014), Rowan & Raudenbush (2016), and White and Ronfeldt (2020) note severe rating errors aligned with scoring when using an observation system. (For more on measurement issues and rating error, see also ▶ Chap. 31, “Cross-Cultural Comparability of Latent Constructs in ILSAs” in this volume.). There is a fundamental tradeoff in this goal of having comparable scores across settings because obtaining consistent data across a range of contexts often means losing context sensitivity (Knoblauch & Schnettler, 2012; Snell, 2011), which may conceal important local aspects of instruction. For example, Bartlett and Vavrus (2017) suggest a combined approach (i.e., context-sensitive approaches together with standardized approaches) for the future of comparative classroom research. The OECD GTI study (Bell, 2021a, Opfer et al., 2020) recognized a combined

478

K. Klette

approach as the preferred solution when developing a framework (and supporting instruments) for analyzing mathematics instruction across eight countries/economies in their GTI video study (OECD, 2020). Stigler and Miller (2018) suggest a somewhat similar solution when they argue for combing three rather broad aspects of teaching quality (i.e., productive struggles, explicit connections, and deliberate practices) to be investigated across contexts and subject areas in combination with detailed content- and curriculum-specific frameworks for the future of comparative classroom studies.

Dimensions of Teaching Practices Captured: Looking Across Frameworks Different frameworks include different facets (e.g., domains and dimensions) of teaching considered indicators of teaching quality. The assumption is that a more robust score on these facets signals better teaching and learning. However, examining how the different observation systems converge and differ it might be difficult for researchers (and stakeholders and practitioners) to decide on which framework would best suit their purposes (Berlin & Cohen, 2018). Although often developed for a specific project or purpose, there are strong similarities across the different frameworks (Bell et al., 2019, Gill et al., 2016; Klette, 2022; Schlesinger & Jentsch, 2016). Analyses suggest, for example, strong commonalities across different observation systems when it comes to the teaching practices captured (Bell et al., 2019; Klette & Blikstad-Balas, 2018; Klieme et al., 2009; Praetorius & Charalambous, 2018) and most frameworks include practices such as “clear goals,” “cognitive challenges,” “supportive climate,” and “classroom management” as key facets when trying to capture features of teaching quality. Researchers seem to have reached a consensus around some key facet and domains that are essential when measuring teaching quality (Klette, 2015; Kunter et al., 2007; Nilsen & Gustafsson, 2016; Praetorius & Charalambous, 2018). These domains include instructional clarity (clear goals, explicit instruction), cognitive activation (cognitive challenge, quality of task, content coverage), discourse features (teacher–student interaction, student participation in content-related talk), and supportive climate (managing classrooms, creating an environment of respect and rapport). Most frameworks include three, or all four, domains (Bell et al., 2019; Schlesinger & Jentsch, 2016). However, although attending to the same key teaching practices and domains (e.g., instructional clarity, classroom discourse, cognitive challenge, and classroom management), the frameworks vary in terms of the grouping of different dimensions/ sub-dimensions within and across domains, the level of details (and grain size) when operationalizing domains into sub-dimensions, and the conceptual language and terminology used. Table 2 summarizes key domains captured (e.g., domains and dimensions), theoretical underpinning, and subject specificity across some frequently used observation systems in international large-scale video studies, including both generic manuals (e.g., Classroom Assessment Scoring System (CLASS)

18

The Use of Video Capturing in International Large-Scale Assessment. . .

479

Table 2 Summarized overview of some illustrative observation manuals

Manual CLASS (Pienta et al., 2008) (1-7 point scale)

Domains 1. Emotional support 2. Classroom organization 3. Instructional support

FFT (Danielson Group, 2013) (1-4 point scale) Dimensions are called components

1. Planning and preparation 2. Classroom environment 3. Instruction 4.Professional responsibilities

Examples of Dimensions/ subcategories 13 subcategories Emotional support: 1a. Positive climate 1b. Negative climate 1c. Teacher sensitivity Classroom organization: 2a. Regard for student perspectives 2b. Behavior management 2c. Instructional learning formats Instructional support: 3a. Concept development 3b. Quality of feedback 3c. Language modeling 22 components (76 smaller elements). For example, Domains 2 and 3 have five components each. Domain 2: Classroom environment 2a. An environment of respect and rapport 2b. Establishing a culture for learning 2c. Managing classroom procedures 2d. Managing student behavior 2e. Organizing physical space Domain 3: Instruction 3a. Communicating with students 3b. Questioning and discussion techniques 3c. Engaging students in learning

Subject specificity Generic

Theoretical underpinning CLASS is grounded developmental psychology and theories of student–teacher interaction

Generic

FFT is grounded in a constructivist view of learning, with emphasis on students’ intellectual engagement

(continued)

480

K. Klette

Table 2 (continued)

Manual

TBD (Klieme et al., 2009; Praetorius et al., 2018) (1-4 point scale)

Domains

1. Classroom management 2. Student support 3. Cognitive activation

Examples of Dimensions/ subcategories 3d. Using assessment in instruction 3e. Demonstrating flexibility & responsiveness 21 dimensions/ subcategories 1a. (Lack of) disruptions and discipline problems 1b. (Effective) time use/time on task 1c. Monitoring/ withitness 1d. Clear rules and routines 2a. Differentiation and adaptive support 2b. Pace of instruction 2c. Constructive approach to errors 2d. Factual, constructive feedback 2e. Interestingness and relevance 2f. Performance pressure and competition 2g. Individual choice options 2h. Social relatedness teacher ! student 2i. Social relatedness student ! teacher 2j. Social relatedness experience student ! student 3a. Challenging tasks and questions 3b. Exploring/ activating prior knowledge 3c. Elicit student thinking 3d. Transmissive understanding of learning

Subject specificity

Theoretical underpinning

Generic

TBD is grounded in cognitive and constructivist views of learning

(continued)

18

The Use of Video Capturing in International Large-Scale Assessment. . .

481

Table 2 (continued)

Manual

ICALT (Van de Grift, 2007) (1-4 point scale)

Domains

1. Classroom management 2. Clear and structured instructions 3. Activating teaching methods 4. Adjusting instruction and learner processing 5. Teachinglearning strategies 6. Safe and stimulating learning climate 7. Learner engagement (3 items)

Examples of Dimensions/ subcategories 3e. Discursive and coconstructive learning 3f. Genetic-Socratic teaching 3g. Supporting metacognition 32 dimensions/ subcategoriesa 1a. Lesson proceeds in an orderly manner 1b. Monitors learners carry out activities 1c. Effective classroom management 2a. Explains the subject material clearly 2b. Gives feedback to students 2c. Checks for subject understanding 3a. Activities that require an active approach 3c. Stimulates learners to think about solutions 3g. Specifies lesson aims (start of the lesson) 4a. Evaluates the lesson/lesson aims 4c. Adjusts instruction to learner differences 4d. Adjusts the subject matter to learner differences 5a. Teach how to simplify complex problems 5b. Stimulates the use of control activities 5e. Encourages to think critically

Subject specificity

Theoretical underpinning

Generic

ICALT is aligned with cognitive and behavioral views of learning

(continued)

482

K. Klette

Table 2 (continued)

Manual

Domains

PLATO (Grossman et al., 2013) (1-4 point scale) Dimensions are called elements

1. Instructional scaffolding 2. Disciplinary demand 3. Representation of content 4. Classroom environment

MQI (Hill et al., 2008) (1-4 point scale)

1. Common core aligned practices 2. Working with students and mathematics 3. Richness of mathematics 4. Errors and imprecision 5. Classroom work is connected to mathematics

Examples of Dimensions/ subcategories 5f. Ask learners to reflect on approach strategies 6a. Respect for students (behavior and language) 6c. Promotes student self-confidence 6d. Fosters mutual respect 12 dimensions/ subcategories 1a. Modelling 1b. Strategy use and instruction 1c. Feedback 1d. Accommodations for language learning 2a. Intellectual challenge 2b. Classroom discourse 2c. Text-based instruction 3a. Representation of content 3b. Connection to prior knowledge 3c. Purpose 4a. Behavioral management 4b. Time management 11 dimensions/ subcategories 1a. Students question/ reason about mathematics 1b. Students provide mathematical explanations 1c. Cognitive requirements of tasks 2a. Responses to students’ mathematical ideas 2b. Teacher remediates student errors

Subject specificity

Theoretical underpinning

Subject specific (language art)

PLATO is grounded in constructivist theories and instructional scaffolding through teacher modelling, explicit teaching of ELAstrategies and guided instruction

Subject specific (mathematics)

MQI is grounded in constructivist views of learning together with teachers’ Mathematical Knowledge for Teaching (MKT)

(continued)

18

The Use of Video Capturing in International Large-Scale Assessment. . .

483

Table 2 (continued)

Manual

TRUMath (Schoenfeld, 2014) (1-3 point scale)

a

Domains

1. The content 2. Cognitive demand 3. Equitable access to content 4. Agency and identity 5. Formative assessment

Examples of Dimensions/ subcategories 3a. Meaning-making and connections 3b. Rich Mathematical practices 4a. Teacher makes content errors 4b. Imprecision in language and notation 4c. Lack of clarity (in content presentation) 5a. Time spent on mathematics/ mathematical ideas Five (broad) dimensions: Accurate, coherent, justified mathematical content? Environment of productive intellectual challenge conducive to every student? Access of the lesson to all students? Students the sources of ideas and discussions? To what extent does instruction build on student ideas?

Subject specificity

Theoretical underpinning

Subject specific (mathematics)

TRUMath aligns with socioconstructivist views of learning and student-centered pedagogies

Only selective examples of the ICALT subcategories listed

(Pianta et al., 2008); Framework For Teaching (FFT) (Danielson Group, 2013); Three Basic Dimensions framework (TBD) (Klieme et al., 2009; Praetorius et al., 2018); the International Comparative Analyses of Learning (ICALT) system (Van de Grift, 2007)) and subject-specific manuals (e.g., Protocol for Language Arts Teaching Observations (PLATO) (Grossman et al., 2013); Mathematics Quality of Instruction (MQI) (Hill et al., 2008); and Teaching for Robust Understanding in Mathematics (TRU Math) (Schoenfeld, 2014). Despite the number of domains and dimensions (e.g., sub-dimensions) included in the different systems and manuals, it is often not made explicit and clear why a dimension is included or not. Furthermore, neither are the motivation or the rationale behind the grouping of the different dimensions (and items clustering) made explicit.

484

K. Klette

Klieme and colleagues (Klieme et al., 2009) developed what they call the Three Basic Dimension framework (TBD) building on the TIMSS framework. Classroom Management, Cognitive Activation, and Student Support are the three key domains for analyzing teaching quality (which again is divided into 21 sub-dimensions) in this framework. In the TBD framework, the quality of Classroom Discourse (e.g., dimension Classroom Discourse), for example, is subsumed under the overall domain Cognitive Activation, while aspects of trust and tolerance in teacher–student and student–student interactions are treated as a dimensions under the domain Student Support (Praetorius et al., 2018, p. 414). In the PLATO system (Grossman et al., 2013), Classroom Discourse is treated as part of the overall domain cognitive demand (Disciplinary Demand in the PLATO vocabulary), as in the TBD framework. Classroom Discourse in PLATO refers to students’ opportunities to engage in contentrelated discussions with their peers and teacher. The CLASS (Secondary) observation system also divides teaching quality into three main domains (i.e., Emotional Support, Classroom Organization, and Instructional Support), which again are divided into 12 dimensions/sub-dimensions. In this observation system, classroom discourse is subsumed under the dimension Content Understanding and refers to students’ opportunities to interact with the teachers as a part of the overall domain Classroom Organization. Despite strong similarities in domains and teaching practices captured, the ways in which the different domains and dimensions are grouped, listed, and conceptualized have severe implications for the empirical validity and possible findings that can be drawn from different studies using different observation systems. When findings are represented as aggregated scores for the overall domain (and not the individual dimension/sub-dimension), it becomes difficult to draw conclusions on the role of, for example, classroom discourse, as this might capture different facets of teaching across the different systems (e.g., classroom discourse as a part of students’ contentrelated talk in the TBD and PLATO frameworks while listed under the domain of classroom organization in the CLASS system). A related, but slightly different dilemma is the question of terminology and how the different sub-dimensions are named. Across the systems (see Table 2 above), we see that similar domains (and dimensions), such as classroom environment, are named rather differently (e.g., classroom environment, classroom climate, classroom management, classroom organization) despite attempting to capture fairly the same phenomenon. Praetorius and Charalambous (2018) (see also Schlesinger & Jentsch, 2016) highlight this problem and show how the same constructs (e.g., domain) may be defined through different terms while similar terms (e.g., classroom discourse) may capture distinctly different things. Thus, we might argue that, without disregarding cultural nuances and contextual differences, “. . .agreeing on some common terminology would be one first and basic step to ensure that our research studies build upon each other” (Praetorius & Charalambous, 2018, p. 545). One step in such a direction could be to encourage observation systems developers (and researchers) to specify what they mean by the different terms supported with illustrating videos. However, as pointed out by Bell and others (Bell et al., 2014; Fischer et al., 2019; Rowan & Raudenbush, 2016), challenges of terminology are recurrent dilemmas across comparative studies

18

The Use of Video Capturing in International Large-Scale Assessment. . .

485

(linguistically, culturally, conceptually); thus, to specify terminology at the level of grain size (e.g., how discrete, targeted, and fine-grained practices are to be analyzed) could be a first step so that researchers (and readers) avoid inappropriate comparisons. As the field evolves, our next step could be to name similar behaviors with the same terminology. In addition, the comprehensiveness (e.g., amount and range) of practices and facets covered differs across systems and frameworks: Some frameworks are rather comprehensive and include a long list of dimensions and sub-dimensions (e.g., ICALT), while others concentrate on some key dimensions of teaching (e.g., CLASS, PLATO). Schoenfeld et al. (2018, p. 37) argue that a feasible and effective observation system must be parsimonious and concentrate on a small number of domains and sub-dimensions. They argue that, if the list is too long, it is hard to tease out what matters the most and what might be seen as sufficient versus necessary factors. Stigler and Miller (2018) pose a fairly similar argument underscoring the contextual nature of teaching, resulting in skepticism toward the idea that teaching can be defined in terms of a list of decontextualized behaviors and “best practices.” As argued, no matter the number of domains and dimensions/sub-dimensions included in a system, it is often not perfectly clear why a dimension/sub-dimension is included or not. Likewise, the motivation or rationale behind the grouping of the different dimensions/sub-dimensions may not be made explicit. The grouping of the dimension Classroom Discourse – either as a part of Disciplinary Demand (e.g., PLATO, TBD) or listed as a part of the domain Classroom Organization as in CLASS – is an interesting illustration of this “lack of explicitness” of dimension groupings. To ensure translatability between observation systems, it would therefore be useful if instrument developers would make these decisions more explicit and clarify the rationale for why a sub-dimension is listed within a specific domain. Again, making the rationale behind these decisions explicit would help researchers in the field build upon each other’s experience. This change would allow the field to accumulate common terminology and categories (when possible), thus moving away from fragmentation and idiosyncratic approaches when analyzing teaching.

Theoretical Underpinning: Views of Teaching and Learning As argued, observation systems embody a community of practice’s view of highquality teaching and learning (Bell et al., 2019). Here, a community of practice could refer to a view of teaching and learning linked to different theoretical traditions like cognitive approaches to learning, socio-constructivist theories of learning, behavior learning theories, and the like as well as to national and country-specific standards (e.g., national curricula, a country’s national teaching standards). Praetorius and Charalambous (2018) identified 11 theoretical underpinnings and research traditions when reviewing mathematics frameworks spanning educational effectiveness theories, learning and teaching theories, subject-specific theories, motivation theories, and theories related to the European didactics (e.g.,

486

K. Klette

the didactic triangle). Luoto (2021) differentiates between cognitive theories, motivational theories, behavioral theories, socio-constructivist theories, and subject-specific theories along with national standards and frameworks when reviewing theoretical underpinnings across observation systems. Scholars’ views will, of course, vary from community to community, emphasizing different facets of teaching and learning. For example, several communities will consider cognitive activation to be a core aspect of high-quality teaching, while the degree to which teachers facilitate classroom discourse and encourage student participation might vary depending on the country’s cultural views of teaching and learning (Clarke et al., 2006a; Martinez et al., 2016; Stigler & Miller, 2018). Communities’ views encompass both group and cultural differences in valued practices assigned different contexts. In the Nordic countries, for example, an observation instrument might privilege a high degree of student engagement as critical to high-quality instruction, while a US instrument might pay attention to explicitness of instruction (Klette & Blikstad-Balas, 2018). As indicated, communities’ perspectives of teaching quality can be located along a continuum that moves from a behaviorist view of teaching and learning to a cognitive view to a more socio-constructivist or situated view. Communities’ perspectives often blur the boundaries across this continuum. Depending on how thoroughly and explicitly the rationale of an instrument is documented, it can be difficult to determine what views underlie a specific observation system. Further, as underscored by Oser and Baeriswyl (2001) and Grossman and McDonald (2008), it might not be productive to dichotomize or oversimplify views of instruction, as doing so can lead to highlighting disparities in how communities define and label teaching (Grossman & McDonald, 2008) rather than focusing on how teaching and learning activities are nested and related. While drawing on the same theoretical underpinnings, observation manuals ascribing to the same theoretical tradition might differ radically at the level of conceptualization, terminology, items, and key concepts of analyses. As a result, readers might struggle to recognize which parts of the theoretical grounding are considered and how these considerations are used and implemented at the level of domains (constructs) and dimensions (items). Here, the Classroom Discourse dimension could again serve as an example. The dimension of Classroom Discourse appears in several subject-specific manuals, such as PLATO and MQI, and generic manuals, such as CLASS, FFT, and TBD. While PLATO and CLASS link their theoretical groundings, respectively, to socioconstructivist and developmental theories of learning (Grossman et al., 2013; Pianta et al., 2008), the TBD framework (Praetorius et al., 2020) presents itself as originating from cognitive approaches to learning (Klieme et al., 2009, Praetorius et al., 2018). Learning theories are, however, seldom operationalized at the level of teaching theories (Oser & Baeriswyk, 2001), so readers might struggle to see how a specific dimension or item arises or relates to a distinct theoretical tradition. This problem becomes even more challenging when the same dimension (e.g., Classroom Discourse) is listed and used to capture rather different facets of teaching across the different observation systems. For example, CLASS captures students’ opportunities

18

The Use of Video Capturing in International Large-Scale Assessment. . .

487

to talk in general, while PLATO and the TBD framework privilege content-specific student talk when scoring for classroom discourse. However, I would also underscore that, despite differences in theoretical underpinnings, some domains (e.g., Classroom Management originating from behavioral theories) appear in almost all manuals and frameworks pursuing rather different theoretical ambitions (be it constructivist, cognitive, or developmental psychology). The point here is twofold. First, studies with similar theoretical perspectives do not necessarily share terminology, items, or a conceptual framework recognized at the level of domains and dimensions. As a result, the empirical categories and facets conceptualized and analyzed may differ substantially between frameworks belonging to the same tradition and pursuing similar theoretical goals. Second, because of discrepancies and inconsistencies in terminology and the grouping and listing of concepts between the theoretical grounding and the empirical definitions of categories (e.g., domains and dimensions), it might be wise to follow the suggestions of Thomas (2007) and Hammersley (2012) and look closer at the actual use of categories (what Hammersley (2012) calls “language games”) when referring to theoretical grounding of the different frameworks. A conceptual level that is closer to the actual analytical framework focusing on terminology and level of conceptualizations may provide a template for exploring how different domains and sub-dimensions delineate similar or different phenomena, how they process evidence and/or outcomes, and the extent to which these are consistent with higherorder theoretical domains and dimensions.

Subject Specificity Scholars have agreed that subject-matter specificity is important when measuring teaching quality (Baumert et al., 2010; Seidel & Shavelson, 2007), but they disagree on how to capture these facets of teaching quality. Researchers have developed several observation systems to evaluate subject-specific practices, such as the Mathematical Quality of Instruction (MQI), the Quality of Science Teaching (QST, Schultz & Pecheone, 2015), and the Protocol for Language Art Observation (PLATO). The MQI observation system, for example, has a targeted focus on content such as the “richness of mathematics,” student participation in mathematical reasoning, and the clarity and correctness of the mathematics presented in the class. Other systems are generic, designed to capture key aspects of teaching held to be critical for student learning across subjects and classes. Classroom Assessment Scoring System (CLASS) (Pianta et al., 2008) is an example of such a system, as are the Framework for Teaching (FFT; Danielson Group, 2013), the International System for Teacher Observation and Feedback (ISTOF) observation system (Teddlie et al., 2006), and the International Comparative Analyses of Learning (ICALT) system (Van de Grift, 2007). Internationally, several scholars have claimed a need for subject specificity when analyzing the qualities of classroom teaching and learning. Hill and Grossman (2013) argue that classroom analyses frameworks must be subject-specific and

488

K. Klette

involve content expertise if they are to achieve the goal of supporting teachers in improving their teaching. This approach, they continue, would enable teachers to provide information that is relevant for situation-specific teaching objectives, regardless of whether these are algebra learning, student participation, or problem solving. Drawing on teachers’ knowledge data, Blömeke and colleagues (Blömeke et al., 2016) demonstrated how a combination of generic factors and subject-specific factors (in mathematics in their case) is required for producing valid knowledge on how different teaching factors contribute to student learning. Klette, Roe and Blikstad-Balas (2021) use the PLATO framework (targeting language arts education) to capture both subject-specific and generic aspects when analyzing features of Norwegian language arts and mathematics instruction. Schoenfeld et al. (2018) underscore how subject-specific systems and manuals in mathematics capture students’ engagement in disciplinary thinking and inquiry methods in a much more precise and significant way than generic systems. However, they also showed how disciplinary thinking is credited differently across frameworks, depending on how the rubrics define and value different aspects of engagement with disciplinary content. Thus, what counts as rich instruction (in mathematics) is not unified and varies across observation systems, including subject-specific mathematics systems. Praetorius and Charalambous (2018) make a similar point when investigating subject-specific, generic, and “hybrid” manuals in mathematics. They argue that these systems cover rather similar aspects of instruction (subject-specific frameworks included) but also vary widely in how they decompose and conceptualize qualities of mathematics instruction. In a recent study combining a generic manual (CLASS manual) and a mathematic-specific Common Core aligned manual (Instructional Practice Research Tool for Mathematics (IPRT-M)), Berlin and Cohen (2020) argue how both approaches are needed, emphasizing how consistent Common Corealigned mathematical engagement only happened in classrooms that were orderly, safe, productive, and emotionally supportive. Considering generic and subject-specific manuals therefore highlight the strong degree of similarity across these manuals in terms of domains and dimensions captured (Klette, 2022), scoring procedures (Bell et al., 2019; Martin et al., 2021), and scales of analyses (T. Kane et al., 2012, White, 2021). After systematically examining 11 manuals used in mathematics (four generic manuals; three subject-specific manuals, and four hybrid manuals), Praetorius and Charalambous (2018) claim that there were more similarities than differences across these systems (CLASS, TBD, ISTOF and DMEE (Dynamic Model of Educational Effectiveness, Kyriakides et al., 2018) (IQA (Boston and Candela, 2018, MQI (Hill et al., 2008), M-Scan (Walkowiak et al.,2018)) (TEDSInstruct (Schlesinger et al., 2018), TRUMath (Schoenfeld, 2014), UTOP, (Walkington & Marder, 2018, MECORS (Lindorff & Sammons, 2018)). They seriously questioned the fruitfulness of the distinction between generic versus subject-specific frameworks and discuss whether they could be replaced by thoroughly validated and comprehensive generic frameworks, supplemented with targeted subject-specific frameworks (Charalambous & Praetorius, 2020). In contrast, Hill and Grossman (2013) take a different position and argue for the necessity of the strength and level of precision developed in subject-specific manuals when aiming at analyzing teaching quality. The

18

The Use of Video Capturing in International Large-Scale Assessment. . .

489

MET study argue (Kane et al., 2012) that there were no big differences across the manuals deployed when trying to measure teaching quality in 3,000 US classrooms using five different manuals (three subject-specific: PLATO, MQI, QST; and two generic: CLASS, FFT). There is probably not one right solution to the question of whether to use generic or subject-specific observation systems. Instead, the answer to this question will depend on the purpose of the study, be it for evaluating purposes, professional development, and/or international comparisons. However, we might argue that it would be useful to build on existing systems rather than developing yet another one when setting out to observe facets of teaching quality, be it generic or subjectspecific.

Grain Size Related to the level of subject specificity, researchers have also addressed the issue of grain size, or how discrete practices are to be analyzed (Hill & Grossman, 2013) within the scope of teaching facets captured. For decades, observation studies have addressed this issue (for early work here, see for example Brophy & Good, 1986; Flanders, 1970). In some newer systems, such as CLASS and PLATO, consensus has been reached on a set of core dimensions (12 for both CLASS and PLATO). This stands in contrast to earlier systems that included a long list of teaching facets to score (Scheerens, 2014). Thus, and as already argued, the number of domains and dimensions to be scored vary across systems. A system’s grain size may, in addition, be related to the number of scale points when measuring a teaching facet, such as the presence of a lesson objective (e.g., purpose). Such a facet might be rated on a four-point scale, a seven-point scale, a dichotomous scale (present or not present), or a combination of scales. However, the number of scale points is not necessarily an indicator of score quality (e.g., reliability, variation across lessons, and segments scored). Matters of score quality are best addressed through a compelling argument that relies on multiple sources of validation evidence following M. Kane (Kane, 2006; see also Gitomer, 2009). Scholars in the field of survey approaches have argued that a stretched scale (a seven-point scale, for example) aligns with higher reliability and validity (Gehlbach & Brinkworth, 2011, Krosnick & Presser, 2010). How this resonates with the use of scales in observation rubrics is not well documented, and several of the observation rubrics referred to in this overview use either a four- or a five-point scale when scoring. To my knowledge, the CLASS observation system is the only one using a seven-point scale. Yet another aspect of grain size and scoring is the balance between the duration and frequency of activity (quantity) and assigned qualities of that activity. In many rubrics, these two aspects go hand in hand and are not separated. For example, to get a high score on a dimension like Classroom Discourse, it is not enough to score students’ opportunity to engage in content-related talk with their peers or the teacher. Achieving a high score should also require that the talk continues for a sustained period (more than 5 min of a 15-min segment, for example). Conversely, for facets

490

K. Klette

such as purpose or clarity of goals, the presence rather than the duration (and frequency) is the main quality criterion. For this reason, some scholars (Humphry & Heldsinger, 2014; White, 2018) have proposed the possibility of using several scales (e.g., Present /Not Present, 1–3, 1–4, 1–5) within the same framework. In addition, researchers have discussed the distinction between measuring “opportunities offered” (by the teacher) and “opportunities used” (by the student) while examining issues of grain size and scoring procedures. Few observation systems make such a distinction and capture both aspects without separating them, although the TBD framework is a distinct exception. Thus, a problem arises if we capture the teaching/learning opportunities offered by the teacher and assume they equal the learning opportunities used by the students. However, several manuals have students’ use of the opportunity as a criterion for getting a high score. For example, the PLATO manual requires observers to see evidence “that the feedback helps student in their activity” (Grossman, 2015, Rubric Feedback) before giving a high score (4) for the Feedback dimension. The decision of whether to separate these two aspects of teaching/learning opportunities needs further elaboration and empirical testing. Like Praetorius and Charalambous (2018), I agree that quantity (frequency and duration) and quality are both important aspects of teaching quality. Furthermore, we might be naïve if we think that offered opportunities are the same as use of opportunities (for more here, see section “Focus on Students Versus Teachers’ Behaviors” below). Whether to score the whole lesson or segments of the lesson is yet another aspect of grain size. One might imagine observation systems that seek to code smaller grain sizes (i.e., narrower teaching practices) might segment the lesson many times so that narrow behaviors can be accurately documented throughout a lesson (e.g., MQI). Alternatively, observation systems using more holistic codes requiring the rater to judge multiple interrelated practices might segment at larger intervals (e.g., 30 min or a whole lesson) so that the ratings reflect all of the interrelated practices (e.g., ICALT). The MQI observation system and the TRU math observation system combine targeted measures based on shorter segments (7.5 min segments and 15 min) with overall and more holistic scores based on the whole lesson. Drawing on the MQI manual to measure mathematics classrooms in Norway, Finland, and the United States, Martin et al. (2021) problematize the weak relations between the holistic scores and scores based on the 7.5-min segments. The question at stake here is how sequencing of the lesson might help raters to score the lesson more accurately and reliably, assuming that dividing the lesson into smaller segments makes the scoring process more precise, manageable, and thus reliable. The issue of “rater error” (Gitomer et al., 2014; White, 2018), or whether raters score lessons accurately and consistently over time, has been debated. White (2018) argues that, in order to produce reliable measures, quality indicators must be applied to both high standards for rater performance and how raters live up to this standard. Returning to the issue of weak coherence between narrow scoring based on 7.5-min segments and holistic scoring based on the whole lesson, we might argue that procedures for rater reliability are insufficient in terms of standards for rater performance, how raters are held to this standard, and the potential reciprocal relation between two different logics.

18

The Use of Video Capturing in International Large-Scale Assessment. . .

491

Decisions about what grain size to capture are further shaped by the rhythm and pace of instruction. Teaching facets are not always equally probable in every segment of a lesson. For example, while instructional purpose may be central to the beginning of a lesson, it may be less central toward the end of the lesson. The degree of lesson segmentation necessary for a specific grain size of practice being scored is a decision made by system developers, but it often remains undocumented (Klette et al., 2017). How sequencing and ways of segmenting lessons impact scoring result thus call for further systematic investigation.

Scoring Specifications Classroom observation systems differ in their scoring procedures, the role of master raters, and preparation and training of raters. The choices made by developers for these three aspects influence the reliability and validity of the observation scores. I discuss each of them below. Scoring Procedures. Observation systems differ in how rating procedures and scoring rules are carried out. How many observations and the number of segments to be scored, the degree to which lessons are double rated, and whether ratings are checked systematically by master raters for accuracy are only some of the rating procedures that are relevant to the validity of the system. Scoring rules concern how ratings are aggregated across units (e.g., segments, lessons, teachers) and across raters (e.g., averaging discrepant ratings, taking the highest rating), as well as addressing issues such as rounding scores, dropping ratings, and various scoring models (e.g., averaging ratings across segments and lessons to the teacher level, using Item Response Theory (IRT) models to create teacher scores). Lessons from the MET study suggest that at least four occasions/lessons per teacher and multiple (and at least two) raters are required to reach a reliable teacher level estimate on the observation systems in that study. Hill et al. (2012b) argue that three lessons per teacher are sufficient for a reliable estimate of teaching quality in mathematics drawing on the MQI manual. Testing out the PLATO manual, Cor (2011) suggests that one must observe at least five 15-min segments per teacher to get an overall reliability greater than 0.80. Baumert and Kunter (2013) and Praetorius et al. (2014) underscore, in addition, that numbers of segments of scoring depend on the domain and teaching facets scored. For example, they argue that three or four lessons in each classroom are sufficient for the Classroom Management facet, while measuring Cognitive Action requires a minimum of nine lessons from each classroom. In the recent GTI video study (OECD, 2020), the researchers ended up videotaping two lessons per classrooms (N¼85 classrooms per country) in each of the eight countries, arguing that two lessons were sufficient to identify country-level relationships between teaching and learning in this specific area (quadratic equations; OECD, 2020). Newton and colleagues (2010) claim that observing five lessons per teachers is needed for assessing secondary mathematics teaching. Praetorius et al. (2014), distinguish between general aspects of teaching, such as classroom management and learning support, and content-related aspects of teaching, such as cognitive

492

K. Klette

activation. For the two first elements (i.e., classroom management and learning support), Praetorius et al. (2014) argue that one lesson per teacher was enough, while they argue (like Baumert & Kunter (2013) that nine lessons were necessary to obtain reliable measures for cognitive activation. As can be seen from this short summary, there is no gold standard here, and the number of lessons or segments required will vary based on (a) the domain of teaching measured, (b) the underlying variation of teaching practice in the sample, (c) the level of the claim anticipated (i.e., teacher, school, country), and (d) how well raters are trained. For the case of rater reliability or double rating, the general advice has been to double rate between 15% and 20% of observations (Cohen & Grossman, 2016; Creswell et al., 2016) with a threshold for interrater agreement between 70% and 92%. However, studies have highlighted high interrater agreement on some facets (e.g., Classroom Management) and rather problematic agreement linked to highinference facets like Classroom Discourse and Strategy Instruction. The MET study suggest at least two raters assigned to the same classroom and that each rater should not rate more than two lessons/days per teacher, showing how the reliability of measurement can be increased through the addition of either more observation days and/or the addition of more raters on a given day. Preparation of Raters: Master Rating Procedures. Teaching and learning interactions occur in multiple modalities; for example, they involve sights, sounds, movement, and words with actual people over time. Observation systems codify those multidimensional interactions onto a sheet of paper with words and video records that “show” what is meant by the words. However, as “frequent” questioning may be one question every 5 min for one observer, but one question every 8 min for another, rubrics need to clarify in detail how the term frequent questioning is to be interpreted. Given the near infinite complexities associated with specifying the meaning of words that comprise observation systems, master raters are often used to define the “gold standard” or “correct” interpretation of the system. Master raters are deemed to know and be able to apply the observation system. They are often used to create master scores for training materials, such as certification videos and calibration exercises. To my knowledge, no studies have examined master raters’ accuracy, although one study compared master raters’ reasoning to that of less expert raters (Bell et al., 2012). In addition, one study found that master raters’ certification rates were similar to or higher than raters’ certification rates (Bell, 2021b). While neither study provided evidence that master raters agree with one another at higher levels than raters, the qualitative study of rater thinking suggests that master raters are more likely to use reasoning that references the rubric than are other raters. This area is ripe for additional research. Preparation of Observers. It is critical that raters are able to create accurate and unbiased scores across teachers and classrooms. Whether they are using predefined observation systems or developing their own coding system or manual, raters thus need some type of training. Raters are usually trained using manuals that provide insight into the theoretical basis of the system, the meaning of the items and scales, and the scoring rules. Training can also provide raters with opportunities to practice by observing videos and scoring them during the training. Certification of raters

18

The Use of Video Capturing in International Large-Scale Assessment. . .

493

could be required, as well as recertification after a specific time period. Required training and certification vary by observation system. Some observation systems require training, certification (e.g., the CLASS manual and the PLATO manual; online versions available), and recertification (e.g., the CLASS manual), while others have no such requirements (e.g., FFT, ISTOF, ICALT). For the GTI video study, the researchers developed their own coding manual (Bell, 2021a) backed by systematic training and certification (Bell, 2021b) including a team of master raters (see also Tremblay & Pons, 2019). The MET study used five different manuals (i.e., CLASS, FFT, MQI, PLATO, and UTOP (UTeacher Observation Protocol), and technical reports suggest a high degree of convergence across the different instruments, while interrater agreement generally was described as low despite extensive rater- training and monitoring (Kane et al., 2012, Rowan & Raudenbush, 2016). Across all instruments, interrater agreement was, however, generally higher when items required compression in scoring (e.g., present/not presents). White and Klette (2021) discuss differences in rating between trained and non-trained raters using the PLATO manual. Their initial findings suggest that severe differences between non-trained and trained were found, especially with regard to accuracy and empirical evidence when scoring. Calibration As a Way of Strengthening Interrater Reliability. Researchers have suggested systematic calibration procedures as a means of strengthening interrater reliability (Bell, 2021a; Joe et al., 2013; White & Ronfeldt, 2020). Since raters’ capacities are limited and likely to drift apart, researchers have suggested systematic monitoring and follow-up support (i.e., calibration) to help raters use the scales accurately and reliably to ensure they stay close to the rubrics. As shown, scoring procedures, requirements, and specifications vary across studies. In recent studies, and especially large-scale comparative studies, in order to obtain reliable scores, a stronger emphasis has been put on rater training and rater calibration and monitoring. Gitomer et al. (2014, p. 2) argue however, that: “. . .preliminary evidence from a handful of large-scale research studies underway in 4–10th grade classrooms suggests that although observers can be trained to score reliably, there are concerns related to initial training, calibration activities designed to keep observers scoring accurately over time, and use of observation protocols”.

Focus on Students or Teachers (or Both) Depending on the focus of the scoring procedures, observation systems might require raters to pay attention to the practices, utterances, and behaviors of teachers, students, or both. Most frameworks focus on teachers’ behaviors and practices (and supporting activities) while often simultaneously paying attention to students’ behaviors and practices/activities. To achieve high scores on the PLATO manual, for example, raters need evidence for students’ active engagement. To get a high score (4) on the element Classroom Discourse, raters must have evidence that “the majority of the students participate by speaking and/or actively listening”; similarly, for a high score on the Feedback element, raters need to see “that the feedback helps

494

K. Klette

student in their activity” (Grossman, 2015, Rubrics Classroom Discourse and Feedback). Purely “teacher-centered” instruction would not receive high scores in manuals like PLATO, CLASS, and TBD. The abovementioned ICALT manual (Van de Grift, 2007) privileges teacher behavior especially. As teachers or students seldom engage in stand-alone activities but take part in a chain of interactions and interlinked relationships, scholars conducting classroom research need to situate their analyses in a larger landscape, which often includes all aspects of the didactic triad (i.e., teachers, students, and content). To analyze learning from students’ perspectives, one most often needs to include the teacher’s activities and utterances, as well as those of the other students. Similarly, as content cannot be analyzed alone but moves at the intersection between the three key elements – the students, the teachers, and the content involved – manuals need to include the focal content in their analyses. However, privileging analyses of the content through the lenses of communication presents a danger of reducing content learning to interaction and interaction patterns, Hammersley (2012) and Klette (2007) claim. Looking across frameworks, we might argue that the field might profit from designing instruments with an explicit focus on student actions, either as a related (student) instrument or as a separate part of the teacher instrument. The TBD framework (Praetorius et al., 2018) offers such an opportunity and aids raters in scoring teachers’ and students’ activities through separate but aligned scoring procedures.

Empirical Evidence: Connecting Teaching with Student Outcomes The validity of the content of the observation system will probably vary. As mentioned in the section on the dimensions of teaching captured, the assumption is that the dimensions of teaching included in observation systems reflect teaching quality. A critical criterion for teaching quality is how much students learn. Thus, it is important to understand the extent to which the assumed relation between the teaching quality indicators and student learning has been confirmed empirically. In other words, we must examine the nature and quality of the research upon which the indicators were based. Empirically, this is often examined by testing the degree to which scores from a particular observation system, which includes specific dimensions of teaching, are associated with student outcomes (e.g., Decristan et al., 2015) or statistically derived measures of teaching quality, such as value-added models (e.g., Bell et al., 2012; T. J. Kane et al., 2013). Despite the desire to use predictive validation studies as the gold standard of empirical evidence, such studies face many problems, including confounds to causal mechanisms, inadequate accounting for prior learning, other school factors that shape teaching and learning (e.g., curriculum, content coverage), and inappropriate outcome measures, just to name a few. While predictive evidence is important, M. Kane (2006) argues that we must consider the validity of any system in the form of a clear validity argument. Such an argument specifies the inferences necessary to move from specific observation ratings to inferences about the sample studied (e.g., the quality of teaching in a given time frame with a specific group of students), all the way to inferences at the

18

The Use of Video Capturing in International Large-Scale Assessment. . .

495

domain level (e.g., all of a teacher’s teaching in a given year with all the students taught). Like Stigler and Miller (2018), Gitomer and colleagues (Gitomer, 2009; Gitomer et al., 2014) note how the correlation between observation scores and other measures of instructional quality has been low (see also Kane et al., 2012; GTI video study (OECD; 2020) and Klette et al., 2021). Stigler and Miller (2018) question the fruitfulness of variable-oriented approaches when investigating features of teaching. Rather than testing and calibrating an endless list of variables to capture qualities in teaching, they propose, as mentioned, three broader areas (i.e., productive struggle, explicit connections, and deliberative practices) as lenses for systematically and comparatively investigating teaching across sites and subjects.

Ethics Ethics permeate large-scale studies and especially video studies on several levels, including sampling and data gathering, analyses, and the way we (re)present the data. A recurrent issue in video studies is the problem of anonymization. Video data is not anonymous, and ethical considerations should be a part of the whole workflow from initial planning to how to display key results and findings. Consent forms, for example, must provide information to participants about the what (the purpose of the study), the who (who will use/have access to the data), and the where (where the data will be stored and for how long). These considerations require the researcher or research team to plan for the whole workflow – what Klette (2019) terms “Ethical by Design.” Besides being a part of the theoretical framing (e.g., ethnocentrism and the intrusion of theoretical assumptions privileging specific aspects of teaching), ethics are at play when gathering the data (e.g., camera angle, audio), when analyzing the data (e.g., theoretical perspectives, how to keep track of nicknames), and how to present the data without harming informants while safeguarding their privacy and confidentiality. In the next section, I turn to how developments in technology (and social media) have paved the way for increased interest and use of video capturing in large-scale assessment studies.

Technology: A New Generation of Video Studies The growing interest in video design can be traced to the rapid development of technology that allows easy capture, storage, and online streaming. Video recording equipment is now miniaturized and portable, and it can be remotely controlled and operated by individual researchers or teachers themselves, thus making such studies feasible and less intrusive on the everyday life of classrooms (for an overview of reactivity in video studies, see Lahn & Klette (2022). New technologies in this field have been paralleled by major developments in coding and processing instruments,

496

K. Klette

software for analyzing video data (e.g., Studio Code, Interact, Observer XT), and systems and infrastructure that facilitate the sharing of data as well as targeted and integrative analyses. One of the benefits of video data/video capturing from classrooms is that it enables analyses that could combine the subject-specific and generic features of teaching and learning, making it perfect for the integrative ambitions linked to comparative didactics (Meyer, 2012). Video data also provide opportunities to combine different analytical and theoretical approaches to the same data set. For example, Berge and Ingerman (2017) combined variation theory and conversation analyses to understand the features of science teaching and learning among bachelor students. Likewise, Ødegaard and Klette (2012) combined the process/product approaches of teaching and learning (e.g., instructional format and activity structures) with subject-specific dimensions (e.g., conceptual language used, quality of explanations) when analyzing science teaching in secondary classrooms in Norway. Recently, Praetorius and Charalambous (2018) used the same video data set (i.e., three elementary mathematics lessons from the National Center of Teacher Effectiveness [NCTE] video library, Harvard) to test out different observation manuals, all aimed at capturing aspects of mathematics instruction. The purpose was to systematically check for possible synergies and complementarities among these frameworks. Videos further enable researchers to test out how the sequencing of the lessons and the use of time sampling impacts empirical validity. Some frameworks divide the lesson into 7-min or 15-min segments when scoring, while other frameworks base their assessment on the whole lesson. Also, at the level of grain size in how the different frameworks identify and parse out teaching practices and how these practices are conceptualized at the level of operationalization and rubrics, video capturing enriches our methodological sensitivity in terms of checking for grain size such as how targeted versus holistic teaching practices are measured as well as the level of subject specificity. To summarize, the use of video capturing has contributed to strengthen methodological rigor, reliability, and validity in classroom studies as well as provided a space for productive dialogues between different research approaches and theoretical perspectives.

Moving Forward Research in this field has made considerable efforts and improvements to conceptualize, operationalize, and measure teaching quality using observation manuals for large-scale studies. Although the field has reached a certain consensus on issues related to how to conceptualize teaching quality (e.g., teaching dimensions captured), scoring and training specifications required, and the number of lessons necessary for a valid evaluation, additional work remains to determine (a) how accurately and precisely observation systems portray teaching quality, (b) how to use these (observation)systems for a “theory of action,” (c) the instruments’ potential

18

The Use of Video Capturing in International Large-Scale Assessment. . .

497

and merits for teacher learning and improvement, and (d) how to handle rater errors. For the latter (e.g., rater error), we especially need to know more about how to measure teaching quality for the purpose of developing rigorous and fair teacher assessment systems drawing on classroom observations. Cohen and Goldhaber (2016) note that observation systems have not been put under the microscope for systematic investigations, despite the increased proliferation and popularity of classroom observations. This comment is especially relevant for the purpose of rating and scoring. For example, some studies have suggested that raters are the largest source of error in the context of research studies using observation systems. These studies have pointed to a large degree of variability among raters (Casabianca et al., 2015; Kelly et al., 2020), inconsistency and “drift” within and among raters (Bell et al., 2014; Hill et al., 2012a), and differences in ratings when applying the instruments to teachers the raters personally know or not (McClellan et al., 2012). (For a more thorough discussion on rater error, see Cohen and Goldhaber (2016), Gitomer (2008), Gitomer et al. (2019), Kelly et al. (2020), Martinez et al. (2016), White (2018), and White and Ronfeldt (2020)). As discussed at the outset of this chapter, video observations together with observation manuals as a means of measuring teaching quality can be used for several purposes. So far, I have discussed challenges and issues when using video observations in the context of international large-scale research studies. In this last section, I will touch upon the possibilities and challenges when using international large-scale video capturing for teacher evaluation and teachers’ professional learning. In addition, I briefly discuss how large-scale video studies could serve as a means to drive teacher change and empower longitudinal classroom studies.

Teacher Evaluation As discussed in the introduction, few international large-scale teacher evaluation studies have drawn on classroom video data as performance measures. In one study, Martinez, Taut, and Schaaf (2016) drew on a purposively selected sample of 16 observation systems in 6 countries (Singapore, Japan, Chile, Australia, Germany, and the United States) to summarize how classroom observations might serve as performance measures in teacher evaluations. Distinguishing between conceptual, methodological, and policy aspects of the observation systems, they highlight convergence in the overall purpose across the systems, but they identify considerable divergence in the degree of standardization of the observation process, how the systems conceptualized good teaching, and how this observation information is used. Looking across the different designs and observation systems, they argue how researchers and policymakers might use the gathered information to reflect on options available to make more informed decisions when using classroom observation for the purpose of evaluating teaching. Conversely, national teacher evaluation systems have been developed to include classroom videos from large samples as a means of “objective” evaluation of teacher performance. In the United States, for example, administration and principals have been using classroom video capturing to evaluate teacher performance based on

498

K. Klette

standardized observation instruments, such as FFT (Danielson Group, 2013) and CLASS (Hamre et al., 2013). As argued above, variability and low interrater agreement in scoring, combined with observation manuals’ lack of capacity to differentiate between less and more efficient teachers (Bell et al., 2014; Casabianca et al., 2013; Kelly et al., 2020; Kraft & Gilmour, 2017), are major impeding issues when using video observation for the purpose of teacher evaluation. For example, Kraft and Gilmour (2017) discuss how differences in school cultures might impede principals’ capacity to accurately assess teachers in their schools; to illustrate this issue, Kraft and Gilmour noted that rather few teachers are rated below the “proficient” level. For the case of consequential teacher evaluation, Barrett et al. (2015) drawing on data from North Carolina report that raters tended to rate a disproportionate number of teachers just above the proficiency threshold. Similarly, Grissom and Loeb (2014) investigated teacher evaluation systems in Miami and report on principals’ tendency to be lenient on the observation scores of teachers they worry may be put under sanctions based on their evaluation. This evidence suggests that administration and principals tend to evaluate on the basis of face-valid indicators rather than accountability indicators, representing a “leniency bias” in many teacher evaluation systems (Rowan & Radenbush, 2016). Summarizing evidence from teacher evaluation over the last decades drawing on data from teachers’ enacted classroom practices, Rowan and Raudenbush (2016) argue that the number of risks and distortions associated with teacher evaluations drawing on classroom observation data make them unlikely to portray a “fair picture” for the consequential evaluation of teachers. Rather, classroom or video observations should be used for school development purposes by exploring how cycles of observation, assessment of student learning outcomes, and feedback “... can be structured to produce not only individual improvement among individual teachers, but also school-wide efforts at instructional improvement” (Rowan & Raudenbush, 2016, p. 1209). This latter approach to teacher evaluation has been a part of a national quality system intended to improve teaching in Chile. In 2003, Chile implemented a national, standards-based, multi-method, mandatory teacher evaluation system known as the National Teacher Evaluation System. The Chilean system combines formative and summative assessment using four distinct instruments (Taut & Sun, 2014): a structured portfolio (including video capturing from classroom teaching), a peer interview, a supervisor/principal questionnaire, and a selfassessment. Looking across the different instruments, Taut and colleagues (Taut et al., 2012; Taut & Sun, 2014) argued that portfolios – and especially video capturing of classroom teaching – provided valid predictability for student achievement scores, particularly with regard to teaching facets such as lesson structure, time on task, student behavior, and student evaluation materials (Santelicesa & Taut, 2011). Despite controversies when introducing such a national system, research on perceived consequences (Taut et al. 2011) (e.g., reported from the point of view of school leaders) has suggested that positive implications outperform negative effects over time. Positive effects are especially reported in increased teamwork and awareness around teaching practices together with internal reflection processes drawing on the assessment data. However, the impact on individual teachers’ developments dominates over institutional effects Taut and colleagues argue (Taut et al., 2011).

18

The Use of Video Capturing in International Large-Scale Assessment. . .

499

Recently, initiatives have been taken to introduce teacher evaluations drawing on classroom video in the United Kingdom (Office for Standards in Education, Children’s Services and Skills [OFSTED], 2018) and the Netherlands (Dobbelaer, 2019; Scheerens, 2014). However, as noted by Martinez et al. (2016) and Rowan and Raudenbush (2016), strong arguments exist for using such data for the purpose of teachers’, schools’, and districts’ continuous improvement rather than high-stakes individual evaluations of teachers. Martinez et al. (2016) argue, for example, how these data could be used for longitudinal purposes, monitoring key facets of teaching and changes in teaching practices over time in addition to providing information about individual teachers’ performance.

Video Documentaries As Longitudinal Data Video documentation provides accurate representations of teaching practices from a given time. When used over a longer time span, video capturing provides accurate and detailed information about trends and developments in teaching and learning practices within a given time frame. For example, video capturing from lower secondary mathematics classrooms in Norway from 2005 to 2018 indicate changes in teachers’ use of activity formats, suggesting a notable increase in group work and peer work (from 4% to 20%) in mathematics during this period alongside a decrease in whole-class instruction and individual seat work (Klette, 2020). However, few studies have used video capturing for the purpose of longitudinal analyses of teachers’ change and/or reform and curriculum implementation. When drawn from a larger, carefully sampled data set, classroom videos might offer valuable information on how teachers enact and react to different reform initiatives and curriculum changes. Further, they might provide us with precise and time-sensitive information on how teachers develop their practice over time. In the current Teaching Over Time (TOT) project, White et al. (2021) capture teacher change over a period of 8 years (2010–2018) by examining video archive data from the MET study (videotaped in 2010) alongside new data from the same classrooms (TOT data videotaped in 2018). Capitalizing on the MET sample, White et al. (2021) use two cycles of video data to explore possible shifts in instructional practices after the introduction of the Common Core State Standards (CCSS) in 2010 in the United States. The goal of the TOT project is to explore how the changes promoted by the CCSS were implemented at the classroom level and if the quality of instruction provided to students had improved. Thus, the TOT project used classroom observations to monitor how a specific national policy (e.g., the CCSS) might shift instructional classroom practices within a specific time period.

Videos As a Means of Improving Teachers’ Professional Learning A third benefit of video capturing is that these approximations to practice might serve as lenses into possible variations in teaching practice within and across

500

K. Klette

countries that could function as a reservoir for teacher learning. Exemplary videos from a larger data set, can be sampled to serve as possible repertories of specific teaching practices, such as how teachers represent or introduce a content area, use questions, organize a classroom discussion, and stimulate student inquiries. Studies have demonstrated that teachers and student teachers may capitalize on video representations of teaching to reflect on and confront their implicit theories of teaching with alternative strategies (Besiegel et al., 2018; Van Es, 2012). The recent MET (BMGF, 2012) and GTI (OECD, 2020) studies developed repositories known as the Measures of Effective Teaching Longitudinal Database and the Global Teaching InSights Library as bonus content to their target research ambitions of using video capturing to measure teaching quality (See https://www.icpsr.umich. edu/web/pages/about/metldb.html) (See https://www.oecd-ilibrary.org/education/ global-teaching-insights_20d6f36b-en). Both BMGF and OECD argue how video clips from carefully sampled classrooms can serve as a means of discussing possibilities and opportunities in teaching within and across contexts. Teaching is contextspecific and unique but also generic and transversal. Despite distinct features linked to unique students and classrooms, all teachers strive toward the same goal: support students’ learning. Using classroom observations to systematically shed light on how this common goal is addressed and pursued across context, countries, and settings could help researchers and teachers to identify how different pedagogical approaches work across contexts and for which purposes. One of the legacies from the TIMSS video study has been the classroom videos demonstrating possible patterns and differences across contexts and countries. However, as noted earlier, Clark and colleagues (Clarke et al., 2006a) have underscored that more variation occurs within teaching practices in a specific country than between countries. In the current LISA Nordic study (Klette et al., 2017), we use cautiously sampled classroom videos from all Nordic countries for the purpose of pre- and in-service learning. We use these exemplary videos from Nordic classrooms to informed discussions about context-specific and/or generic teaching practices across the Nordic countries. Likewise, exemplary classroom videos could serve as approximations to practice in teacher education. Reviews of using videos in teacher education have concluded that such use has achieved promising results (Calandra & Rich, 2015; Sherin & Russ, 2014) but requires further investigation for a number of issues, such as how these programs affect student learning and produce sustainable pedagogical practices (Cochran Smith et al., 2015). An additional avenue for investigation is how such programs of pre- and in-service learning take into account differences in purpose, setting, and target groups (Gaudin & Chaliès, 2015). Researchers have, for example, identified important differences between designs that either ask student teachers to reflect on episodes from their own classrooms or have student teachers access scripted versions highlighting consensually validated exemplars (Borko et al., 2010). In addition, student teachers participation in the video recording and editing may be important for their uptake of feedback (Tripp & Rich, 2012). The analytical focus has varied from studies intended to stimulate reflection (Rich & Hannafin, 2009) to those that emphasize noticing (Van Es et al., 2014) and/or shaping core

18

The Use of Video Capturing in International Large-Scale Assessment. . .

501

teaching practices (McDonald et al., 2013). In the interest of further progress using videos in pre-and in service learning, four major concerns might be pursued. First, very few studies have investigated how video-based teacher education and professional development affect classroom practices and student achievement. Second, evidence is needed for sustainable and generalized effects on video-based teacher education and learning. Third, there is a need for large-scale comparative studies of the differential outcomes of videos as means for professional learning both in pre-service and in-service teaching. Finally, methodological improvements are crucial in terms of validating and scaling up measurement tools in teacher learning and teacher education.

Concluding Remarks As argued, theoretical and methodological advancements in measuring classroom teaching and instruction, together with innovation in video technology, have paved the way for a new generation of classroom studies strengthening international largescale research designs and improvements in teacher–learning collaboration. Researchers have made significant advancements in conceptualizing, operationalizing, and measuring teaching quality, and scholars agree that measures of teaching effectiveness derived from classroom observation and student achievement data provide useful predictive information about future student outcomes. However, research has also suggested that measures of teaching quality may suffer from the threat of distortion and risks such as rater error, lack of common goals and purposes, differences in terminology and conceptual language (e.g., items and constructs), difficulties in operationalization (e.g., teaching dimensions captured, grain size), and measurement issues (e.g., convergent and divergent validity). While I am optimistic about the potential of analyzing aspects of teaching quality through the means of video capturing and standardized observation manuals, several important questions remain to be answered. For future research in this area the following avenues will be especially critical for further development: How to differentiate, conceptualize, measure, and foster teaching quality through the lens of observation while remaining sensitive to contextual and local factors. How to decompose teaching into elements that counts, are parsimonious (easy to understand and use), and have proven fruitful for the intended purpose. Last, but not least, how to cultivate reliable measures targeted to achieve such an ambition.

References Barrett, N., Crittenden-Fueller, S., & Guthrie, J. E. (2015). Subjective ratings of teachers: Implications for strategic and high-stakes decisions [Conference presentation]. Association of Educational Finance and Policy Annual Meeting, Washington, DC. Bartlett, L., & Vavrus, F. (2017). Rethinking case study research: A comparative approach. Routledge.

502

K. Klette

Baumert, J., Kunter, M., Blum, W., Brunner, M., Voss, T., Jordan, A., Klusmann, U., Krauss, S., Neubrand, M., & Tsai, Y.-M. (2010). Teachers’ mathematical knowledge, cognitive activation in the classroom, and student progress. American Educational Research Journal, 47(1), 133–180. https://doi.org/10.3102/0002831209345157 Baumert, J., & Kunter, M. (2013). The COACTIV model of teachers’ professional competence. In M. Kunter, J. Baumert, W. Blum, U. Klusmann, S. Krauss, & M. Neubrand (Eds.), Cognitive activation in the mathematics classroom and professional competence of teachers: Results from the COACTIV project (Vol. 8, pp. 25–48). Springer. Beisiegel, M., Mitchell, R., & Hill, H. C. (2018). The design of video-based professional development: An exploratory experiment intended to identify effective features. Journal of Teacher Education, 69(1), 69–89. Bell, C.A. (2021a). The development of the study observation coding system. In OECD (Ed.) Global teaching insights technical report (Ch. 4). OECD. http://www.oecd.org/education/ school/GTI-TechReport-Chapter4.pdf Bell, C.A. (2021b). Rating teaching components and indicators of video observations. In OECD (Ed.) Global teaching insights technical report (Ch. 6). OECD. http://www.oecd.org/education/ school/GTI-TechReport-Chapter6.pdf Bell, C. A., Gitomer, D. H., McCaffrey, D. F., Hamre, B. K., Pianta, R. C., & Qi, Y. (2012). An argument approach to observation protocol validity. Educational Assessment, 17(2–3), 62–87. https://doi.org/10.1080/10627197.2012.715014 Bell, C. A., Qi, Y., Croft, A., Leusner, D. M., Gitomer, D., McCaffrey, D., & Pianta, R. (2014). Improving observational score quality: Challenges in observer thinking. In T. J. Kane, R. Kerr, & R. Pianta (Eds.), Designing teacher evaluation systems: New guidance from the Measures of Effective Teaching project (pp. 50–97). Jossey-Bass. Bell, C. A., Dobbelaer, M. J., Klette, K., & Visscher, A. (2019). Qualities of classroom observation systems. School Effectiveness and School Improvement, 30(1), 3–29. https://doi.org/10.1080/ 09243453.2018.1539014 Berge, M., & Ingerman, A. (2017). Multiple theoretical lenses as an analytical strategy in researching group discussions. Research in Science & Technological Education, 35(1), 42–57. https://doi.org/10.1080/02635143.2016.1245657 Berlin, R., & Cohen, J. (2018). Understanding instructional quality through a relational lens. ZDM, 50(3), 367–379. https://doi.org/10.1007/s11858-018-0940-6 Berlin, R., & Cohen, J. (2020). The convergence of emotionally supportive learning environments and College and Career Ready Mathematical Engagement in Upper Elementary Classrooms. AERA Open, 6(3). https://doi.org/10.1177/2332858420957612 Bill & Melinda Gates Foundation (BMGF) (2012). Gathering feedback for teaching: Combining high quality observations with student surveys and achievement gains. Author. http://eric.ed. gov/?id¼ED540960 Blömeke, S., Gustafsson, J.-E., & Shavelson, R. J. (2015). Beyond dichotomies: Competence viewed as a continuum. Zeitschrift für Psychologie, 223(1), 3–13. https://doi.org/10.1027/ 2151-2604/a000194 Blömeke, S., Busse, A., Kaiser, G., König, J., & Suhl, U. (2016). The relation between contentspecific and general teacher knowledge and skills. Teaching and Teacher Education, 56, 35–46. https://doi.org/10.1016/j.tate.2016.02.003 Borko, H., Jacobs, J., & Koellner, K. (2010). Contemporary approaches to teacher professional development. In: Penelope Peterson, Eva Baker, Barry McGaw, (Editors), International encyclopedia of education (pp. 548–556). https://doi.org/10.1016/B978-0-08-044894-7.00654-0. Bostic, J., Lesseig, K., Sherman, M., & Boston, M. (2019). Classroom observation and mathematics education research. Journal of Mathematics Teacher Education. https://doi.org/10.1007/s10857019-09445-0 Boston, M. D., & Candela A. G. (2018). The Instructional Quality Assessment as a tool for reflecting on instructional practice. ZDM: The International Journal of Mathematics Education, 50:427–444. https://doi.org/10.1007/s11858-018-0916-6

18

The Use of Video Capturing in International Large-Scale Assessment. . .

503

Brophy, J. E., & Good, T. L. (1986). Teacher behavior and student achievement. In M. C. Wittrock (Ed.), Handbook of research on teaching (Vol. 3, pp. 328–375). Casabianca, J. M., McCaffrey, D. F., Gitomer DH, Bell CA, Hamre BK. (2013). Effect of observation mode on measures of secondary mathematics teaching. Educational and Psychological Measurement, 73(5), 757–783. Casabianca, J. M., Lockwood, J. R., & McCaffrey, D. F. (2015). Trends in classroom observation scores. Educational and Psychological Measurement, 75(2), 311–337. https://doi.org/10.1177/ 0013164414539163 Calandra, B., & Rich, P. J. (2015). Ed. Routledge. Charalambos C. Y. & Praetorius A.K. (2020). Creating a forum for researching teaching and its quality more synergistically. Studies in Educational Evaluation, Vol 67. (On line version). Chevallard, Y. (1992). Fundamental Concepts in didactics: perspectives provided by an anthropological approach. Research in Didactique of Mathematics. Selected papers (pp. 131–167). Paris & Grenoble: ADIREM & La Pensée Sauvage. Clarke, D.J., Emanuelsson, J., Jablonka, E., & Mok, I. (2006a). Making connections: Comparing mathematics classrooms around the world (Vol. 2). Sense Publishers. Clarke, D.J., Keitel, C., & Shimizu, Y. (2006b). Mathematics classrooms in twelve countries: The insider’s perspective (Vol. 1). Sense Publishers. Clarke, D.J., Wang, L. Xu, L., Aizikovitsh-Udi E & Cao, Y. (2012). International comparisons of mathematics classrooms and curricula: The Validity-Comparability Compromise. In T. Y. Tso (Ed.) Proceedings of the 36th conference of the international group for the psychology of mathematics education (PME 36) (Vol. 2, pp. 171–178). Taipeu, Taiwan, July 18-to 22. Cochran Smith, M., Villegas, A. M., Abrahams, L., Chaveez_Moreno, L, Mills, T. Y., & Stem, R. (2015). Critiquing teacher preparation: On overview of the Field Part II. Journal of Techer Education, 66(2), 109–121. https://doi.org/10.1177/0022487114558268 Coe, R., Aloisi, C., Higgins, S., & Major, L. E. (2014). What makes great teaching? Review of the underpinning research. http://www.suttontrust.com/researcharchive/great-teaching/ Cohen, J., & Goldhaber, D. (2016). Building a more complete understanding of teacher evaluation using classroom observations. Educational Researcher, 45(6), 378–387. https://doi.org/10. 3102/0013189x16659442 Cohen, J., & Grossman, P. (2016). Respecting complexity in measures of teaching: Keeping students and schools in focus. Teaching and Teacher Education, 55, 308–317. https://doi.org/ 10.1016/j.tate.2016.01.017 Cor, M. K. (2011). Investigating the reliability of classroom observation protocols: The case of PLATO. Creswell, J.,Schwantner, U. & Waters, C. (2016). A review of international large-scale assessments in education. Assessing component skills and collecting contextual data. PISA, The World Bank, OECD Publishing. https://doi.org/10.1787/9789264248373-en. Danielson Group. (2013). The framework for teaching evaluation instrument. Author. https:// danielsongroup.org/products/product/framework-teaching-evaluation-instrument Decristan, J., Klieme, E., Kunter, M., Hochweber, J., Büttner, G., Fauth, B., Hondrich, A. L., Rieser, S., Hertel, S., & Hardy, I. (2015). Embedded formative assessment and classroom process quality: How do they interact in promoting science understanding? American Educational Research Journal, 52(6), 1133–1159. https://doi.org/10.3102/0002831215596412 Derry, S. J., Pea, R. D., Barron, B., Engle, R. A., Erickson, F., Goldman, R., & Sherin, B. L. (2010). Conducting video research in the learning sciences: Guidance on selection, analysis, technology, and ethics. Journal of the Learning Sciences, 19(1), 3–53. https://doi.org/10.1080/ 10508400903452884 Dobbelaer, M. (2019). The quality and qualities of school observation systems [Unpublished doctoral dissertation]. University of Twente, Netherlands. Fischer, H., & Neumann, K. (2012). Video analysis as a tool for understanding science instruction. In D. Jorde & J. Dillon (Eds.), Science education research and practice in Europe: Retrospective and prospective (pp. 115–139). Sense Publishers.

504

K. Klette

Fischer, H., Labudde, P., Neumann, K., & Viiri, J. (2014). Quality of instruction in physicscomparing Finland, Germany and Switzerland. Waxmann. Fischer, J., Praetorius, A.-K., & Klieme, E. (2019). The impact of linguistic similarity on crosscultural comparability of students’ perceptions of teaching quality. Educational Assessment, Evaluation and Accountability, 31(2), 201–220. https://doi.org/10.1007/s11092-019-09295-7 Flanders, N. A. (1970). Analyzing teaching behavior. Addison-Wesley. Gaudin, C., & Chaliès, S. (2015). Video viewing in teacher education and professional development: A literature review. Educational Research Review, 16(16), 41–67. https://doi.org/10.1016/ j.edurev.2015.06.001 Gehlbach, H., & Brinkworth, M. (2011). Measure twice, cut down error: A process for enhancing the validity of survey scales. Review of General Psychology, 15(4), 380–387. https://doi.org/10. 1037/a0025704 Gill, B., Shoji, M., Coen, T., & Place, K. (2016). The content, predictive power, and potential bias in five widely used teacher observation instruments. REL 2017-191. National Center for Education Evaluation and Regional Assistance Gitomer, D. H. (2009). Measurement issues and assessment of teaching quality. Sage. Gitomer, D., Bell, C., Qi, Y., McCaffrey, D., Hamre, B. K., & Pianta, R. C. (2014). The instructional challenge in improving teaching quality: Lessons from a classroom observation protocol. Teachers College Record, 116(6), 1–20. Gitomer, D. H., Martínez, J. F., Battey, D., & Hyland, N. E. (2019). Assessing the assessment: Evidence of reliability and validity in the edTPA. American Educational Research Journal. https://doi.org/10.3102/0002831219890608 Givvin, K., Hiebert, J., Jacobs, J., Hollingsworth, H., & Gallimore, R. (2005). Are there national patterns of teaching? Evidence from the TIMSS 1999 Video Study. Comparative Education Review, 49(3), 311–343. https://doi.org/10.1086/430260 Goldman, R., Pea, R., Barron, B., & Derry, S. J. (2007). Video research in the learning sciences. Lawrence Erlbaum. Griffin, G., & Leibetseder, D. (2019). “Only applies to research conducted in Sweden”: Dilemmas in gaining ethics approval in transnational qualitative research. International Journal of Qualitative Methods, 18, 1–10. https://doi.org/10.1177/1609406919869444 Grissom, J.A. & Loeb, S, (2014). Assessing principals’ assessments: Subjective evaluations of teacher effectiveness in low- and high-stakes environment. Paper presented at Association for Education Finance and Policy annual meeting, San Antonio, TX. Grossman, P. (2015). Protocol for language arts teaching observations (PLATO 5.0). Stanford University. https://cset.stanford.edu/research/project/protocol-language-arts-teaching-observations-plato Grossman, P., & McDonald, M. (2008). Back to the future: Directions for research in teaching and teacher education. American Educational Research Journal, 45(1), 184–205. https://doi.org/10. 3102/0002831207312906 Grossman, P., Loeb, S., Cohen, J., & Wyckoff, J. (2013). Measure for measure: The relationship between measures of instructional practice in middle school English language arts and teachers’ value-added scores. American Journal of Education, 119(3), 445–470. https://doi.org/10.1086/ 669901 Hammersley, M. (2012). Troubling theory in case study research. Higher Education Research & Development: Questioning Theory-Method Relations in Higher Education Research, 31(3), 393–405. https://doi.org/10.1080/07294360.2011.631517 Hamre, B. K., Pianta, R. C., Downer, J. T., DeCoster, J., Mashburn, A. J., Jones, S. M., Brown, J. L., Cappella, E., Atkins, M., Rivers, S. E., Brackett, M. A., & Hamagami, A. (2013). Teaching through interactions: Testing a developmental framework of teacher effectiveness in over 4,000 classrooms. The Elementary School Journal, 113, 461–487. https://doi.org/10. 1086/669616 Hiebert, J., Gallimore, R., Garnier, H., Givvin, K. B., Hollingsworth, H., Jacobs, J., Chui, A. M.-Y., Wearne, D., Smith, M., Kersting, N., Manaster, A., Tseng, E., Etterbeck, W., Manaster, C., Gonzales, P., & Stigler, J. (2003). Teaching mathematics in seven countries: Results from the TIMSS 1999 Video Study. Education Statistics Quarterly, 5(1), 7–15.

18

The Use of Video Capturing in International Large-Scale Assessment. . .

505

Hill, H. C., Blunk, M., Charalambous, C., Lewis, J., Phelps, G. C., Sleep, L., et al. (2008). Mathematical knowledge for teaching and the mathematical quality of instruction: An exploratory study. Cognition and Instruction, 26(4), 430–511. Hill, H., Charalambous, C. Y., Blazar, D., McGinn, D., Kraft, M. A., Beisiegel, M., Humez, A., Litke, E., & Lynch, K. (2012a). Validating instruments for observation instruments: Attending to multiple sources of variation. Educational Assessment, 17(2–3), 88–106. https://doi.org/10. 1080/10627197.2012.715019 Hill, H., Charalambous, C. Y., & Kraft, M. A. (2012b). When rater reliability is not enough: Teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64. https://doi.org/10.3102/0013189X12437203 Hill, H., Beisiegel, M., & Jacob, R. (2013). Professional development research: Consensus, crossroads, and challenges. Educational Researcher, 42(9), 476–487. https://doi.org/10.3102/ 0013189X13512674 Hill, H., & Grossman, P. (2013). Learning from teacher observations: Challenges and opportunities posed by new teacher evaluation systems. Harvard Educational Review, 83(2), 371–384. https:// doi.org/10.17763/haer.83.2.d11511403715u376 Humphry, S. M., & Heldsinger, S. A. (2014). Common structural design features of rubrics may represent a threat to validity. Educational Researcher, 43(5), 253–263. https://doi.org/10.3102/ 0013189X14542154 Janík, T., & Seidel, T. (2009). The power of video studies in investigating teaching and learning in the classroom. Waxmann. Joe, J., Kosa, J., Tierney, J., & Tocci, C. (2013). Observer calibration. Teachscape. Kane, M. (2006). Validation. Educational Measurement, 4, 17–64. Kane, T. J., Staiger, D. O., McCaffrey, D., Cantrell, S., Archer, J., Buhayar, S., & Parker, D. (2012). Gathering feedback for teaching: Combining high-quality observations with student surveys and achievement gains. Bill & Melinda Gates Foundation, Measures of Effective Teaching Project. https://files.eric.ed.gov/fulltext/ED540960.pdf Kane, T. J., McCaffrey, D. F., Miller, T., & Staiger, D. O. (2013). Have we identified effective teachers? Validating measures of effective teaching using random assignment: Bill & Melinda Gates Foundation, Measures of Effective Teaching Project. https://files.eric.ed.gov/fulltext/ ED540959.pdf Kelly, S., Bringe, R., Aucejo, E., & Fruehwirth, J. (2020). Using global observation protocols to inform research on teaching effectiveness and school improvement: Strengths and emerging limitations. Education Policy Analysis Archives, 28(62). Kersting, N. B., Givvin, K. B., Sotelo, F. L., & Stigler, J. W. (2010). Teachers’ analyses of classroom video predict student learning of mathematics: Further explorations of a novel measure of teacher knowledge. Journal of Teacher Education, 61(1–2), 172–181. https://doi. org/10.1177/0022487109347875 Klette K. (2007): Trends in Research on teaching and Learning in Schools: Didactics meets Classroom studies. European Educational Research Journal, 6 (2), 147-161. https://doi.org/ 10.2304/eerj.2007.6.2.147 Klette, K. (2009). Challenges in strategies for complexity reduction in video studies. Experiences from the PISA+ study: A video study of teaching and learning in Norway. In T. Janik & T. Seidel (Eds.), The power of video studies in investigating teaching and learning in the classroom (pp. 61–83). Waxmann Publishing. Klette, K. (2015). Introduction: Studying interaction and instructional patterns in classrooms. In K. Klette, O. K. Bergem, & A. Roe (Eds.), Teaching and learning in lower secondary schools in the era of PISA and TIMSS (pp. 1–16). Springer International Publishing. Klette, K. (2019). Ethical by design: secure, accessible and shareable video data.[Conference presentation]..European Educational Research Association (EERA) Annual Conference (ECER), Hamburg, September 2–6. Klette, K. (2020). Hva vet vi om god Undervisning [Summarizing researh on teaching quality: What do we know?] In: R. Krumsvik, & R. Säljô (Eds.). Praktisk Pedagogisk Utdanning [Practical Teacher Training]. Fagbokforlaget-

506

K. Klette

Klette, K. (2022/ accepted). Coding manuals as way of strengthening programmatic research in classroom studies. In Ligozat et al. (Eds.) Didactics in a changing world. European perspectives on learning, teaching and curriculum. Springer Publishing Klette, K., Blikstad-Balas, M., & Roe, A. (2017). Linking instruction and student achievement: Research design for a new generation of classroom studies. Acta Didactica Norge, 11(3), 1–19. https://doi.org/10.5617/adno.4729 Klette, K., & Blikstad-Balas, M. (2018). Observation manuals as lenses to classroom teaching: Pitfalls and possibilities. European Educational Research Journal, 17(1), 129–146. https://doi. org/10.1177/1474904117703228 Klette, K., Roe, A., & Blikstad-Balas, M. (2021). Observational scores as predictors for student achievement gains. In K. Klette, M. Tengberg, & M. Blikstad-Balas (Eds.), Ways of measuring teaching quality: Possibilities and pitfalls. Oslo University Press. Klieme, E., Pauli, C., & Reusser, K. (2009). The Pythagoras study: Investigating effects of teaching and learning in Swiss and German mathematics classrooms. In T. Janik & T. Seidel (Eds.), The power of video studies in investigating teaching and learning in the classroom (pp. 137–160). Waxmann. Knoblauch, H., & Schnettler, B. (2012). Videography: Analysing video data as a “focused” ethnographic and hermeneutical exercise. Qualitative Research, 12(3), 334–356. https://doi. org/10.1177/1468794111436147 Kraft, M. A., & Gilmour, A. F. (2017). Revisiting the widget effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46, 234–249. https://doi.org/ 10.3102/0013189X17718797 Krosnick, J. A., & Presser, S. (2010). Question and questionnaire design. In J. D. Wright & P. V. Marsden (Eds.), Handbook of survey research (Vol. 2, pp. 263–313). Emerald Group. Kunter, M., Baumert, J., & Köller, O. (2007). Effective classroom management and the development of subject-related interest. Learning and Instruction, 17(5), 494–509. https://doi.org/10. 1016/j.learninstruc.2007.09.002 Kyriakides, L., Creemers, B. P. M., & Panayiotou, A. (2018). Using educational effectiveness research to promote quality of teaching: The contribution of the Dynamic model. ZDM: The International Journal on Mathematics Education, 50(3), 381–393. Lahn L. C. & Klette, K (2022). Reactivity beyond contamination? An integrative literature review of video studies in educational research. International Journal of Research and Methods in Education [Manuscript accepted for publication] Leung, F. K. S. (1995). The mathematics classroom in Beijing, Hong Kong and London. Educational Studies in Mathematics, 29(4), 297–325. https://doi.org/10.1007/BF01273909 Lindorff, A., & Sammons, P. (2018). Going beyond structured observations: Looking at classroom practice through a mixed method lens. ZDM: The International Journal on Mathematics Education, 50(3), 521–534. Lipowsky, F., Rakoczy, K., Pauli, C., Drollinger-Vetter, B., Klieme, E., & Reusser, K. (2009). Quality of geometry instruction and its short-term impact on students’ understanding of the Pythagorean Theorem. Learning and Instruction, 19(6), 527–537. https://doi.org/10.1016/j. learninstruc.2008.11.001 Liu, S., Bell, C. A., Jones, N. D., & McCaffrey, D. F. (2019). Classroom observation systems in context: A case for the validation of observation systems. Educational Assessment, Evaluation, and Accountability, 31, 61–95. https://doi.org/10.1007/s11092-018-09291-3 Luoto, J. M. (2021). Exploring, understanding, and problematizing patterns of instructional quality: A study of instructional quality in Finnish–Swedish and Norwegian lower secondary mathematics classrooms. PhD thesis, University of Oslo, 2021. Luoto, J. M., Klette, K., & Blikstad-Balas. (2022). Possible biases in observation systems when applied across contexts: Conceptualizing, operationalizing and sequencing instructional quality. [Manuscript accepted for publication] Martin, C., Radisic, J., Stovner R. B., Klette, K., & Blikstad-Balas M. (2021). Exploring the use of mathematics observation tools across the contexts of the United States, Norway, and Finland: How can observation instruments shape our understanding of instructional quality when applied across contexts? [Manuscript submitted for publication] in Educational Assessment, Evaluation and Accountability

18

The Use of Video Capturing in International Large-Scale Assessment. . .

507

Martinez, F., Taut, S., & Schaaf, K. (2016). Classroom observation for evaluating and improving teaching: An international perspective. Studies in Educational Evaluation, 49, 15–29. https:// doi.org/10.1016/j.stueduc.2016.03.002 McClellan, C., Atkinson, M., & Danielson, C. (2012). Teacher evaluator training & certification: Lessons learned from the measures of effective teaching project. Practitioner Series for Teacher Evaluation. Teachscape. San Francisco, CA. McDonald, M., Kazemi, E., & Kavanagh, S. S. (2013). Core practices and pedagogies of teacher education: A call for a common language and collective activity. Journal of Teacher Education, 64(5), 376–386. https://doi.org/10.1177/0022487113493807 Mikeska, J. N., Holtzman, S., McCaffrey, D. F., Liu, S., & Shattuck, T. (2018). Using classroom observations to evaluate science teaching: Implications of lesson sampling for measuring science teaching effectiveness across lesson types. Science Education, 103(1), 123–144. https://doi.org/10.1002/sce.21482 Meyer, M. A. (2012). Didactics in Europe. Zeitschrift für Erziehungswissenschaft, 15, 449–482. https://doi.org/10.1007/s11618-012-0322-8 Muijs D, Reynolds D, Sammons P, Kyriakides L, Creemers BPM, & Teddlie C. (2018). Assessing individual lessons using a generic teacher observation instrument: how useful is the International System for Teacher Observation and Feedback (ISTOF)? ZDM: The International Journal on Mathematics Education, 50(3), 395–406. https://doi.org/10.1007/s11858-0180921-9 Newton, X. A. (2010). Developing indicators of classroom practice to evaluate the impact of district mathematics reform initiative: A generalizability analysis. Studies in Educational Evaluation, 36(1), 1–13. https://doi.org/10.1016/j.stueduc.2010.10.002 Nilsen, T., & Gustafsson, J.-E. (2016). Teacher quality, instructional quality and student outcomes: Relationships across countries, cohorts and time (Vol. 2). Springer International Publishing. Ødegaard, M., & Klette, K. (2012). Teaching activities and language use in science classrooms: Categories and levels of analysis as tools for interpretation. In D. Jorde & J. Dillon (Eds.), Science education research and practice in Europe (pp. 181–202). Sense Publishers. OECD (Ed.) (2020). Global teaching insights: A video study of teaching. https://doi.org/10.1787/ 20d6f36b-en Office for Standards in Education, Children’s Services and Skills (Ofsted). (2018). Six models of lesson observation: An international perspective. https://assets.publishing.service.gov.uk/government/ uploads/system/uploads/attachment_data/file/708815/Six_models_of_lesson_observation.pdf Opfer, D., Bell, C., Klieme, E., Mccaffrey, D., Schweig, J., & Stetcher, B. (2020). Chapter 2 Understanding and measuring mathematics teaching practice. In OECD: Global teaching insights. A video study of teaching (pp. 33–47). https://doi.org/10.1787/98e0105a-en Oser, F. K., & Baeriswyl, F. J. (2001). Choreographies of teaching: Bridging instruction to learning. In V. Richardson (Ed.), Handbook of research on teaching (Vol. 4, pp. 1031–1065). American Educational Research Association. Pianta, R., & Hamre, B. (2009). Conceptualization, measurement, and improvement of classroom processes: Standardized observation can leverage capacity. Educational Researcher, 38(2), 109–119. https://doi.org/10.3102/0013189x09332374 Pianta, R. C., La Paro, K. M., & Hamre, B. K. (2008). Classroom assessment scoring system: Forms, pre-K-3. Paul H. Brookes Publishing. Praetorius, A.-K., Pauli, C., Reusser, K., Rakoczy, K., & Klieme, E. (2014). One lesson is all you need? Stability of instructional quality across lessons. Learning and Instruction, 31, 2–12. https://doi.org/10.1016/j.learninstruc.2013.12.002 Praetorius, A.-K., & Charalambous, C. Y. (2018). Classroom observation frameworks for studying instructional quality: Looking back and looking forward. ZDM: The International Journal on Mathematics Education, 50(3), 535–553. https://doi.org/10.1007/s11858-018-0946-0 Praetorius, A.-K., Klieme, E., Herbert, B., & Pinger, P. (2018). Generic dimensions of teaching quality: The German framework of Three Basic Dimensions. ZDM: The International Journal on Mathematics Education, 50(3), 407–426. https://doi.org/10.1007/s11858-018-0918-4 Praetorius, A.-K., Rogh, W., Bell, C., & Klieme, E. (2019). Methodological challenges in conducting international research on teaching quality using standardized observations. In

508

K. Klette

L. Suter, E. Smith, & B. D. Denman (Eds.), The SAGE handbook of comparative studies in education (pp. 269–288). SAGE. Praetorius, A. K., Grunkorn, J., & Klieme, E. (2020). Towards developing a theory of generic teaching quality: Origin, current status, and necessary next steps regarding the three basic dimensions model. Zeitschrift für Pädagogik Beiheft, 1, 15–36. Raudenbush, S. W. (2008). Advancing educational policy by advancing research on instruction. American Educational Research Journal, 25, 206–230. Raudenbush, S. W., & Jean, M. (2015). To what extent do student perceptions of classroom quality predict teacher value added? In T. J. Kane, K. A. Kerr, & R. C. Pianta (Eds.), Designing teacher evaluation systems (pp. 170–202). Jossey Bass. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). see https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri¼CELEX:32016R0679 Rowan, B., & Raudenbush, S. W. (2016). Teacher evaluation in American schools. In D. H. Gitomer & C. A. Bell (Eds.), Handbook of research on teaching (5th ed., pp. 1159–1217). American Educational Research Association. Roth, K. J. (2006). Teaching Science in Five Countries: Results from the TIMSS 1999 Video Study: Statistical Analysis Report. US Department of Education, National Center for Education Statistics. Rich, P. J., & Hannafin, M. (2009). Video annotation tools: Technologies to scaffold, structure, and transform teacher reflection. Journal of Techer Education, 60(1), 52–67. https://doi.org/10. 1177/0022487108328486 Santagata, R., Kersting, N., Givvin, K. B., & Stigler, J. W. (2010). Problem implementation as a lever for change: An experimental study of the effects of a professional development program on students’ mathematics learning. Journal of Research on Educational Effectiveness, 4(1), 1–24. https://doi.org/10.1080/19345747.2010.498562 Santelicesa, M. V., & Taut, S. (2011). Convergent validity evidence regarding the validity of the Chilean standards-based teacher evaluation system. Assessment in Education: Principles, Policy & Practice, 18(1), 73–93. Scheerens, J. (2014). School, teaching, and system effectiveness: Some comments on three state-ofthe-art reviews. School Effectiveness and School Improvement, 25(2), 282–290. https://doi.org/ 10.1080/09243453.2014.885453 Schleicher, A. (2011). Lessons from the world on effective teaching and learning environments. Journal of Teacher Education, 62(2), 202–221. https://doi.org/10.1177/0022487110386966 Schlesinger, L., & Jentsch, A. (2016). Theoretical and methodological challenges in measuring instructional quality in mathematics education using classroom observations. ZDM: The International Journal on Mathematics Education, 48(1-2), 29–40. https://doi.org/10.1007/s11858-016-0765-0 Schlesinger, L., Jentsch, A., & Kaiser, G., et al (2018). Subject-specific characteristics of instructional quality in mathematics education. ZDM: The International Journal of Mathematics Education, 50, 475–490. https://doi.org/10.1007/s11858-018-0917-5 Schoenfeld, A. (2014). What makes for powerful classrooms, and how can we support teachers in creating them? A story of research and practice, productively intertwined. Educational Researcher, 43(8), 404–412. https://doi.org/10.3102/0013189X14554450 Schoenfeld, A. H., Floden, R., El Chidiac, F., Gillingham, D., Fink, H., Hu, S., & Zarkh, A. (2018). On classroom observations. Journal for STEM Education Research, 1(1–2), 34–59. https://doi. org/10.1007/s41979-018-0001-7 Schweisfurth, M. (2019). Qualitative comparative education research: Perennial issues, new approaches and good practice. In L. E. Suter, E. Smith, & B. D. Denman (Eds.), The SAGE handbook of comparative studies in education (pp. 258–268). SAGE. Schultz, S. E., & Pecheone, R. L. (2015). Assessing quality teaching in science. In T. J. Kane, A. K. A. Kerr, & R. C. Pienta (Eds.), Designing teacher evaluation systems: New guidance from the measures of effective teaching project (pp. 444–492). Wiley. Seidel, T., & Shavelson, R. J. (2007). Teaching effectiveness research in the past decade: The role of theory and research design in disentangling meta-analysis results. Review of Educational Research, 77(4), 454–499. https://doi.org/10.3102/0034654307310317

18

The Use of Video Capturing in International Large-Scale Assessment. . .

509

Seidel, T., & Prenzel, M. (2006). Stability of teaching patterns in physics instruction: Findings from a video study. Learning and Instruction, 16(3), 228–240. https://doi.org/10.1016/j.learninstruc. 2006.03.002 Sensevy, G. (2011). Overcoming fragmentation: Towards a joint action theory in didactics. In Hudson, B. & Meyer, M. A. (Eds.) Beyond fragmentation: Didactics, learning, and teaching, pp. 60-76. Verlag Barbara Budrich, Opladen and Farmington Hills. Sensevy, G., & Mercier, A. (Eds.). (2007). Agir ensemble: L’action didactique conjointe du professeur et des élèves. Presses Universitaires de Rennes. Sherin, M. G., & Russ, R. S. (2014). Teacher noticing via video: The role of interpretive frames. In Digital video for teacher education (pp. 11–28). Routledge. Snell, J. (2011). Interrogating video data: Systematic quantitative analysis versus microethnographic analysis. International Journal of Social Research Methodology, 14(3), 253–258. https://doi.org/10.1080/13645579.2011.563624 Standards for Educational and Psychological Testing. (1999). (AERA, APA & NCME) Stigler, J. W., & Hiebert, J. (1997). Understanding and improving classroom mathematics instruction. Phi Delta Kappa, (1997September), 14–21. Stigler, J. W., & Hiebert, J. (1999). The teaching gap: Best ideas from the world’s teachers for improving education in the classroom. Free Press. Stigler, J. W., & Miller, K. F. (2018). Expertise and expert performance in teaching. In A. M. Williams, A. Kozbelt, K. A. Ericsson, & R. R. Hoffman (Eds.), The Cambridge handbook of expertise and expert performance (2nd ed., pp. 431–452). Cambridge University Press. https:// doi.org/10.1017/9781316480748.024 Stuhlman, M. W., Hamre, B. K., Downer, J. T., & Pianta, R. C. (2010). Why should we use classroom observation. Teachstone. Taut, S., Cortés, F., Sebastian, C., & Preiss, D. (2009). Evaluating school and parent reports of the national student achievement testing system (SIMCE) in Chile: Access, comprehension, and use. Evaluation and Program Planning, 32(2), 129–137. https://doi.org/10.1016/j.evalprogplan.2008. 10.004 Taut, S., & Rakoczy, K. (2016). Observing instructional quality in the context of school evaluation. Learning and Instruction, 46, 45–60. https://doi.org/10.1016/j.learninstruc.2016.08.003 Taut, S., Santelices, M. V., Araya, C., & Manzi, J. (2011). Perceived effects and uses of the national teacher evaluation system in Chilean elementary schools. Studies in Educational Evaluation, 37, 218–229. Taut, S., Santelices, M. V., & Stecher, B. (2012). Teacher assessment and improvement system. Educational Assessment, 17(4), 163–199. https://doi.org/10.1080/10627197.2012.735913 Taut, S., & Sun, Y. (2014). The development and implementation of a national, standards based, multi-method teacher performance assessment system in Chile. Education Policy Analysis Archives, 22(71). https://doi.org/10.14507/epaa.v22n71.2014 Teddlie, C., Creemers, B., Kyriakides, L., Muijs, D., & Yu, F. (2006). The International System for Teacher Observation and Feedback: Evolution of an international study of teacher effectiveness constructs. Educational Research and Evaluation, 12(6), 561–582. https://doi.org/10.1080/ 13803610600874067 Thomas, G. (2007). Education and theory: Strangers in paradigms. Open University Press. Tremblay, K., & Pons, A. (2019). The OECD TALIS video study – Progress report. OECD. http:// www.oecd.org/education/school/TALIS_Video_Study_Progress_Report.pdf Tripp, T. R., & Rich, P. J. (2012). The influence of video analysis on the process of teacher change. Teaching and Teacher Education, 28(5), 728–739. https://doi.org/10.1016/j.tate.2012.01.011 Van de Grift, W. J. C. M. (2007). Quality of teaching in four European countries: A review of the literature and application of an assessment instrument. Educational Research, 49(2), 127–152. https://doi.org/10.1080/00131880701369651 Van Es, E. A. (2012). Examining the development of a teacher learning community: The case of a video club. Teaching and Teacher Education, 28(2), 182–192. Walkington, C., & Marder, M. (2018). Using the UTeach Observation Protocol (UTOP) to understand the quality of mathematics instruction. ZDM: The International Journal on Mathematics Education, 50 (3), 507–519

510

K. Klette

Walkowiak, T. A., Berry, R. Q., Pinter, H. H., & Jacobson, E. D. (2018). Utilizing the M-Scan to measure standards-based mathematics teaching practices: Affordances and limitations. ZDM: The International Journal on Mathematics Education, 50 (3), 461–474 White, M. C. (2018). Rater performance standards for classroom observation instruments. Educational Researcher, 47(8), 492–501. https://doi.org/10.3102/0013189X18785623 White, M. C (2021/accepted). A validity framework for the design and analysis of studies using standardized observation systems. In K. Klette, M. Blikstad-Balas, & M. Tengberg (Eds.), Ways of measuring teaching quality: Perspectives, potentials and pitfalls. Oslo University Press. White, M.C. & Klette, K (2021). Rater error in standardized observations of teaching: Challenges from Latently continuous dimensions. Paper presented at the Earli conference 2021, Gothenburg, August 23–27. White, M., & Ronfeldt, M. (2020). Monitoring rater quality in observational systems: Issues due to unreliable estimates of rater quality. [Manuscript submitted for publication] White, M., Maher, B., & Rowan, B. (2021). Common Core-related shifts in English language arts teaching from 2010 to 2018: A video study. [Manuscript submitted for publication] Yang, X., Kaiser, G., König, J., & Blömeke, S. (2019). Professional noticing of mathematics teachers: A comparative study between Germany and China. International Journal of Science and Mathematics Education, 17(5), 943–963. https://doi.org/10.1007/s10763-018-9907-x

Comparison of Studies: Comparing Design and Constructs, Aligning Measures, Integrating Data, Cross-validating Findings

19

Eckhard Klieme

Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prelude: Covering 48 years in Mathematics Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparing and Finding Common Grounds in Design and Conceptualization . . . . . . . . . . . . . . . . Country Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Domain of Assessment and Further Skills Covered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Context, Input, Process, and Output-Related Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparing, Linking, and Matching Operational Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Differences in Instruments and Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linking Measures from Different Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Searching for Universal Descriptors and Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Combining Data Across Assessments in Educational Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Integrating Diverging and Converging Findings Across Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . Matching Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Longitudinal Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analyzing ILSA Data on the Country Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparing TIMSS and PISA Achievement Results on the Country Level . . . . . . . . . . . . . . . . Comparing Country Level Trends Based on TIMSS and PISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions and Implications for Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

512 513 515 515 516 518 519 520 520 523 525 527 527 529 530 531 532 535 537 538

Abstract

Even after six decades of international student assessments, we only weakly understand how and why educational systems are changing in the long run. One reason is the diversity of studies differing in design, sampling, conceptualization (e.g., research constructs covered), and measures. Such variation can be found both between and within long-standing programs of student assessment E. Klieme (*) DIPF | Leibniz Institute for Educational Research and Information, Frankfurt am Main, Germany e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_20

511

512

E. Klieme

like TIMSS and PISA. The chapter aims at showing similarities and differences between ILSAs, understanding what can and what cannot safely be compared and combined, with the goal of finding common grounds for future research. Throughout, the lower secondary samples from TIMSS and PISA will be used as major examples. Section “Comparing and Finding Common Grounds in Design and Conceptualization” compares the conceptual foundations and design across ILSAs, especially the selection and definition of constructs covered, including cognitive tests as well as questionnaire-based measures of student background, educational processes, and noncognitive outcomes. Section “Comparing, Linking, and Matching Operational Measures” figures out when and how to match empirical measures and to establish common scales. Section “Combining Data Across Assessments in Educational Research” looks at approaches for integrating data from separate ILSAs into complex analyses, such as longitudinal analyses on the individual, class, or school level, and multilevel analyses combining information from multiple studies. Section “Analyzing ILSA Data on the Country Level” discusses the cross-validation of trend information from TIMSS and PISA. Using ILSA’s “Big Data” without considering the details of conceptualization, measurement, and data structure may lead to erroneous findings and policy conclusions. Recently, there has been some convergence in design and methods across ILSA studies and programs. Yet, more systematic approaches to instrument development and study design are in need. Keywords

Large-scale assessment · Study design · Trend analysis · PISA · TIMSS

Introduction International Large Scale Assessments (ILSAs) have been introduced by the International Association for the Evaluation of Educational Achievement (IEA) in 1959 through its first “pilot” study covering 12 countries. As documented across this volume, many individual ILSAs have been implemented over the last six decades. Some assessment programs have been providing trend information on student achievement and other features of school systems over the last two decades. With seven waves administered until 2019 and 2018, respectively, the Trends in International Mathematics and Science Study (TIMSS) – managed by IEA with scientific leadership provided by Boston College – and the Programme of International Student Assessment (PISA) – managed by OECD with scientific leadership mainly from Australian Council of Educational Research (ACER, 2000–2012) and Educational Testing Service (ETS, since 2015) –, are by far the most powerful and most cited studies. Still, after six decades of international student assessments, we only weakly understand how and why educational systems are changing in the long run. One

19

Comparison of Studies: Comparing Design and Constructs, Aligning Measures,. . .

513

reason is the diversity of studies differing in design, sampling, conceptualization (e.g., research constructs covered), and measures, as illustrated in the first section (“Prelude”) of this chapter. The chapter aims at understanding similarities and differences between ILSAs, with the goal of finding common grounds for future research. Notably, TIMSS and PISA will allow for the cross-validation of trend information from two independent chains of studies. In order to understand what can safely be compared and combined, the main body of this chapter is organized into four sections: • Section “Comparing and Finding Common Grounds in Design and Conceptualization” will compare the conceptual foundations and design across ILSAs, especially the selection and definition of constructs covered. These of course include the cognitive assessments, but also the “Context Assessment” (Kuger et al., 2016), addressing student background, educational processes, noncognitive outcomes, etc. through “background questionnaires.” As most areas of assessment, and many methodological issues, are treated in detail in separate chapters of this volume, this section will be quite short. • Section “Comparing, Linking, and Matching Operational Measures” will figure out when and how to match empirical measures, that is, test scales and indices based on questionnaire data. Are measures from separate studies actually comparable when claiming to operationalize the same construct such as “Mathematical Literacy,” “Reading Comprehension,” “Enjoyment of Science,” or “Socioeconomic status (SES)”? How can common scales be established? Is it possible to establish “universal” scales linking measures from a large variety of studies? • Section “Combining Data Across Assessments in Educational Research” will look at approaches for combining data from separate ILSAs and integrating them into complex analyses, such as longitudinal analyses on the individual, class, or school level, or multilevel analyses combining information from multiple studies. • Section “Analyzing ILSA Data on the Country Level” will finally ask: How can we apply integrated analyses to better understand findings on the country level, especially to establish trends? Throughout, the lower secondary (grade 8) classroom samples from TIMSS and the samples of 15-year-old students from PISA will be used as major examples.

Prelude: Covering 48 years in Mathematics Education Arguably, the most important accomplishment of ILSAs is the monitoring of trends in educational achievement in order to understand factors driving system growth (or decline), and advise future policies accordingly. However, so far there is little theoretical work (for an exception, see Sun et al., 2007), and also little empirical analysis of long-term system-level change (see Kaplan & Jude, this volume). Yet, there is an exceptionally long chain of ILSA data relating to mathematics

514

E. Klieme

Table 1 Mathematics Achievement in 11 countries participating in FIMS 1994, TIMSS 1995, and PISA 2012

Israel Japan Belgium Germany England Scotland Netherlands France Australia United States Sweden a

FIMS 1964 13-year-old students Original Rescaleda 32.3 0.61 32.2 0.60 30.4 0.48 25.4 0.12 23.8 0.01 22.3 0.10 21.4 0.16 21.0 0.19 18.9 0.34 17.8 0.42 15.3 0.60

TIMSS 1995 Grade 8 students Original Rescaleda 522 0.07 605 0.84 547 0.20 509 0.22 506 0.25 498 0.34 541 0.14 538 0.10 530 0.02 500 0.32 519 0.11

PISA 2012 15-year-old students Original Rescaleda 466 0.36 536 0.37 515 0.15 514 0.14 495 0.06 498 0.03 523 0.24 495 0.06 504 0.04 481 0.20 478 0.24

Based on the means of country means and the pooled variance, giving equal weights to countries

achievement. In 1964, the First International Mathematics Study (FIMS) was implemented by IEA as the first representative ILSA ever. Twelve countries participated (Medrich & Griffith, 1992). Thirty years later, 11 of them participated in the Third International Mathematics and Science Study (TIMSS 1995), again mandated by IEA (Beaton et al., 1996). Another 17 years later, all of those participated in OECD’s PISA 2012 assessment (OECD, 2013), the most recent PISA wave focusing on mathematics. Table 1 shows country means for mathematics achievement as measured in those studies. Is there anything we may learn from such a table beyond cross-sectional comparisons and rankings – for example, Japan and Belgium consistently fell among the three top performers and the USA among the three lowest performing countries from this list? All three studies used broad tests of mathematical achievement, administered to representative samples of lower secondary students. Yet, the exact definition of target populations differed (see head line in Table 1). Also, the studies used scales that are incommensurable: tests were conceptualized differently, there was no overlap in items, and scaling methods differed. One way of aligning the information is to standardize measures within each study, based on the mean of country means and the pooled variance. Such rescaled values (see Table 1) may help understand the relative change in country performance, compared to this specific group of countries – assuming tests were measuring similar constructs and countries’ relative performance is consistent across age cohorts 13–15. If those assumptions were deemed valid, we may conclude that between-country variation became smaller over time: In FIMS, rescaled country means ranged from 0.60 to +0.61 with a variance of 0.170, while the between country variance was reduced to 0.113 in TIMSS 1995 and 0.048 in PISA 2012. Relative mathematical achievement developed positively across the three studies in the Netherlands and the USA, and negatively in Israel and Belgium.

19

Comparison of Studies: Comparing Design and Constructs, Aligning Measures,. . .

515

France, Sweden, and Japan had a peak in 1995, while Germany, England, and Scotland dropped in 1995 and developed relatively positive thereafter. Some of these moves may be explained by changes in sampling or construct definition. For example, the German FIMS sample was restricted to a few West German states. Israel tested its Hebrew population only until 1995, and included the Arab population later (Zuzovsky, 2008), which explains the latter part of its “decline” in relative performance. As the PISA mathematics assessment includes much more application-oriented tasks than the IEA assessments (see next section of this chapter), the “drop” between TIMSS and PISA for Japan and France – countries with a traditional focus on pure mathematics (Klieme & Baumert, 2001) – as well as the slight “increase” for Germany, the Netherlands, and the English-speaking systems may in part be due to different foci in the assessments. This juxtaposition of information from three different ILSAs illustrates how easy it might be to tell fascinating “stories,” albeit deeper thinking questions the validity of such interpretations. Combining data across studies may be informative, for example, for validating findings or analyzing trends. However, before combining data, we need to compare concepts, designs (including sampling), and measures to find out if and to what extent studies are in fact commensurable. If possible, we may come up with matching procedures (rescaling, linking) which allow data from one study to be mapped onto another.

Comparing and Finding Common Grounds in Design and Conceptualization Country Coverage The most obvious difference between separate ILSAs is their respective country coverage. There are no two studies covering exactly the same set of countries. TIMSS (grade 8) and PISA both include education systems across all continents, yet there is limited overlap. IEA has a history of serving a diversity of countries, both by including them in its studies and by supporting national or regional assessment programs. When PISA started in 2000, it was based almost exclusively on affluent industrialized countries with OECD membership. In 2003, when both studies were virtually run in parallel, PISA covered all 30 OECD members and 11 “partner” countries/systems, while TIMSS-Grade 8 covered 13 OECD members and 35 other countries, with an overlap of 17. In 2015, when the studies were run in parallel again, PISA reported on 73 countries or education systems, TIMSS (Grade 8) on 40, with an overlap of 30 countries or systems (Table 2). TIMSS was relatively strong in the Middle East and was the only program to include any sub-Saharan African country. In all other regions of the world, PISA had grown to include nearly every country that participated in TIMSS-Grade 8 (exception: Armenia) plus many more. PISA was covering practically all of Europe and most of the Americas. It should be noted that the growth in country coverage requires caution even when comparing across PISA waves. OECD is used to reporting the “OECD average” as a

516

E. Klieme

Table 2 Country participation in PISA 2015 and TIMSS 2015 by region Region Africa Asia Australia and Oceania Europe Middle East North America South and Middle America Overall

PISA 2015 2 12 2 39 6 3 9 73

TIMSS Grade 8 2015 3 9 2 11 12 2 1 40

Overlap 0 8 2 11 6 2 1 30

benchmark for each measure. However, as membership has grown to 37 (in 2020), this average does not have a stable geographical reference. Also, some statistical transformations used in computing the “Economic, social and cultural status (ESCS)” index for students are based on parameters calculated across OECD members (OECD, 2016b, Watermann et al., 2016; for how this affects the measurement of trend see section “Comparing, Linking, and Matching Operational Measures”). Still, Africa, Latin America, South Asia, and Oceania are underrepresented in these international programs. Therefore, regional initiatives such as SACMEQ and PASEC in Africa and South-East Asia or LLECE, a program sponsored by UNESCO, in Latin America are important providers of assessments that are tailored to the needs and capabilities of those countries. At the same time, there may be good reasons why developing countries have lined up to join the PISA program, or at least link national data to the PISA metric, for example, through OECD’s “PISA for Development” program. OECD has become a very powerful global player in educational policy. Governments may perceive joining PISA as an opportunity to document and foster educational, social, and economic development related to global standards. Thus, the topic of this chapter, comparing across ILSAs, is not just a methodological challenge or a topic raised by researchers for curiosity. Global political trends and power shifts are driving the quest for universal standards in general, and for linking different kinds of achievement measures in particular. These motifs clearly stimulate the search for a universal scaling mechanism for reading items, which could serve as the tertium comparationis between different ILSAS (see below). If such a mechanism would exist and work properly, different programs as well as individual ILSAs and national assessments may be aligned, so that the problem of differences in country coverage becomes obsolete.

Sampling With regard to sampling, there are three groups of ILSA studies: 1. With elementary school students (PIRLS, TIMSS fourth grade, LLECE third, fourth, and sixth grade, PASEC sixth grade, SACMEQ sixth grade)

19

Comparison of Studies: Comparing Design and Constructs, Aligning Measures,. . .

517

2. With secondary school students (TIMSS eighth grade and TIMSS-Advanced, PISA 15-year-olds) 3. With adults (PIAAC, TALIS, TEDS-M) The choice of target groups may reflect countries’ focus in educational policy making as well as practical conditions. For example, in developing countries, compulsory schooling often ends after grade 6 or 8, so that larger parts of any cohort would not be covered by the PISA target definition (15-year-old boys and girls who regularly attend schools). When higher sectors of the education system are studied, as, for example, in TIMSS-Advanced (Mathematics and Science in Upper Secondary Education), the amount of cohort coverage must be taken into account, for example, for comparing findings to data from studies in lower secondary education. Most studies in educational assessments are using multistage sampling, with sampling institutions (schools) first and students within institutions next. When comparing studies, exclusion criteria need to be checked carefully. Studies may differ, for example, on whether they include all kinds of private schools and special needs education. Sampling based on families, households, or work places is much more difficult and expensive than sampling in schools – yet, it may lead to a more inclusive, truly representative description of how children and adolescents are growing up. Such differences will matter when comparing school-based studies to other kinds of social science studies – for example with regard to measuring student well-being, which is a topic recently introduced in PISA as well as, for example, in the household-based World Vision survey (Andresen et al., 2017). TIMSS is designed to represent the population of students attending mathematics classrooms after 8 years of regular schooling. Randomly, lower secondary schools are sampled, and entire grade 8 mathematics classes are sampled within schools. Student age mostly ranges between 13 and 15 years. PISA is designed to represent the population of 15-year-old boys and girls attending school. Randomly, lower secondary schools are sampled, and individual students are sampled within schools. Most PISA students are attending 8th, 9th, or 10th grade. As a consequence, (a) PISA students are on average older than TIMSS students, (b) the difference between mean age in TIMSS and mean age in PISA varies depending on a country’s rules for school entry and grade retention. The difference in sampling strategy may lead to unexpected differences in findings. For example, the achievement gap between students repeating a grade and non-repeaters reported in PISA (where repeaters will be assessed in lower grades than their peers) is twice as large as the same gap reported in LLCE’s grade-based regional assessment for Latin America (Cardoso, 2020). In Germany (as in several other countries), a reasonable number of students are repeating grades, and the majority of them are boys. As a consequence, the gender gap in reading achievement estimated for Germany, favoring girls, is larger in PISA (with girls predominantly from grade 10 and quite a few same-age boys from lower grade levels) than it would be in TIMSS. Obviously, such differences may lead to “policy evidence by design” (Cardoso, 2020), for example, with regard to grade retention and gender-related policies.

518

E. Klieme

The sampling schemes also provide distinct opportunities for research. TIMSS measures can be well interpreted at the classroom level, due to the full coverage of classrooms sampled, and teacher information can be linked to the students in his or her class directly. The resulting data set is best suited to curriculum research, to research on opportunity-to-learn and teaching effectiveness. The PISA data, on the other hand, will show better representativeness for the school as a whole and allow for studying variation in respect of the students’ years of schooling. Sampling schemes also differ for teachers. TIMSS, TALIS, and PISA all include teacher questionnaires administered in secondary schools. TALIS and PISA teacher questionnaires are in part using common measures, for example, for teacher background and education. Yet, results may differ, because TALIS is sampling from all teachers, TIMSS is sampling mathematics and science teachers who teach sampled classes, and PISA is covering teachers who are eligible for teaching 15-year-old students. Within programs, improvements in sampling may have the negative consequence of hindering comparability across cycles. For example, the 2015 TIMSS international report did not use previous data for trend analysis due to increased population coverage in several cases (Mullis et al., 2016, appendix A.1). Thus, each and every study comparing ILSA test results either across programs or across waves within programs needs to double-check the details of sampling.

Domain of Assessment and Further Skills Covered A further set of differences results from the programs’ goals and domains of assessment. While PASEC, PIRLS, SACMEQ, and TIMSS always target the same cognitive outcome domains (reading in PIRLS, French and mathematics in PASEC and SACMEQ, and mathematics and science in TIMSS), PISA assesses a different major domain in each cycle, including two other domains as “minor” outcomes. PISA 2000, 2009, and 2018 had a focus on reading, PISA 2003 and 2012 on mathematics, while the major domain in PISA 2006 and 2015 was science. The TIMSS Assessment of Grade 8 Mathematics Achievement and the PISA Mathematics Literacy Assessment are both covering a broad array of student knowledge and understanding in lower secondary school mathematics (e.g., Mullis & Martin, 2013; OECD, 2016a), but there are a number of differences in study conceptualization and test design (Kuger & Klieme, 2016). As Wu (2010) and Hutchison and Schagen (2007) point out, the two frameworks, of TIMSS and PISA, differ greatly in their definition of mathematics and science performance. These differences are in line with the programs’ goals of a yield study (PISA) and a teaching and learning in class study (TIMSS; Schmidt et al., 2015). The TIMSS test is based on comprehensive analysis of mathematics curricula worldwide, and it is supposed to be curricular valid across countries, that is, to cover mathematical ideas and tasks that students have seen in classrooms. The PISA test is based on a more general concept of “life skills” that students are supposed to need in order to be ready for further learning, starting a successful vocational or professional career, and becoming an informed citizen.

19

Comparison of Studies: Comparing Design and Constructs, Aligning Measures,. . .

519

Context, Input, Process, and Output-Related Constructs The majority of ILSAs include a set of common material in their context assessment, independent of their study goals, design, or sampling strategy. For example, every study includes some basic measures of student background – gender, date of birth, country of birth, languages used at school and/or at home – and some measure of SES. Those measures are needed to check the representativeness of the sample, to report on equity between groups of students, and to control for “exogenous” variables when studying the effect of educational factors on student achievement. Starting with the first representative international Study, FIMS teachers and/or school principals have been asked to provide additional “background” on the school, such as school size, staff qualifications, and resources. All those “Input” variables may be used descriptively to characterize the condition of schooling across countries, they may be combined to indicate (un)equal access to learning opportunities, and they may be used as control variables in statistical analyses. Similarly, “Context” variables provide information on economic, social, and cultural conditions – for example, wealth, demography, shared norms and values – which may shape the education system. FIMS also introduced a “student opinion booklet” to measure “Output,” that is, affective outcomes of education. Subsequently, ILSAs have used a large number of “non-cognitive outcomes,” as they are labeled now. Most prominently, students’ interest in the subject tested, and/or self-efficacy regarding that subject, and crosscurricular outcomes such as general learning motivation, self-concept, educational aspirations, and well-being at school and in life. All those constructs are relevant from a Human Resources perspective on education. Finally, in addition to Context, Input and Output, ILSAs over the years referred to a growing number of constructs related to “Processes” in education – on the individual level (e.g., learning time, course attendance, truancy, and learning strategies), the classroom level (opportunity to learn specific content, teaching practices, classroom climate), the organizational level (school climate, collaboration, leadership, school policies), and the system level (school autonomy, evaluation, and accountability). Coverage of such constructs links ILSAs to the knowledge base in Educational Effectiveness Research (Klieme, 2013). Thus, ILSA studies typically gather information on context and input factors, processes, and education outcomes on system, school, and individual levels. The so-called CIPO taxonomy was developed at IEA to categorize constructs accordingly (Purves, 1987). More recently, ILSAs are usually based on a framework document that lists, classifies, and justifies the selection of constructs implemented. For example, the PISA 2012, 2015, and 2018 questionnaire designs were based on an enhanced CIPO-Model composed of both domain-specific and domain-general “modules” (Klieme & Kuger, 2016). For an attempt to integrate frameworks from different ILSA programs into a coherent conceptual structure, see Jensen and Cooper’s (2015) work on PISA and TALIS. Even between TIMSS and PISA, there is a growing overlap in constructs covered – as, for example, PISA integrates opportunity-to-learn which had been covered for many years by TIMSS, while

520

E. Klieme

TIMSS is providing more sophisticated information on student SES and school climate than it used before. Frameworks and overlaps in construct coverage help develop a common language and methodology for designing and analyzing ILSAs, especially as they easily translate into multi-level prediction models – typically with outcomes as depending variables, individual/classroom processes as predictors and mediators, school/system level processes as moderators, and input/context as control variables. However, the convergence in construct coverage, frameworks, and modeling approaches may dub more subtle differences: even if the construct names and categories may look similar, actual measures may be different, as discussed in the next section.

Comparing, Linking, and Matching Operational Measures Differences in Instruments and Measures Unfortunately, even with similar or identical construct names and conceptual rubrics that seem to be well aligned, the commonality across ILSAs is seldom an exact match, leaving each study with its own realization – that is, test and questionnaire design, tasks and questions, item wording –, thereby limiting comparability across studies. This caveat applies to achievement tests, measures of SES, and questionnaire scales alike. The introduction of computer-based assessment adds another source of variation in measurement. Achievement tests: Even if domains are similar, they may be conceptualized differently, depending in part on the goals and values of the organization sponsoring the assessment: IEA is concerned with schooling and curriculum from a professional education perspective, while UNESCO is addressing education from a human rights perspective, and OECD is interested in skill production from a human capital perspective (Cardoso, 2020). As discussed in the previous section, disciplinary knowledge and skills are highlighted in IEA’s TIMSS, while problem solving and general mathematical competencies are playing a stronger role in OECD’s PISA (Wu, 2010; Hutchison & Schagen, 2007). As a consequence, PISA test items tend to be more often embedded into real-word contexts and to provide lengthier text than typical TIMSS items (Neidorf et al., 2006). TIMSS administers more mathematics items than PISA, especially more short, multiple-choice items, it more often addresses knowledge on facts and procedures, and has a larger proportion of items on Numbers and Algebra as compared to Data and Uncertainty. Interestingly, experts comparing TIMSS and PISA mathematics items came to different conclusions depending on the criteria for good measurement. Klieme et al. (2001), based on ratings by trained experts, praised PISA items for requiring higher levels of conceptual understanding. Similarly, Neidorf et al. (2006) highlight the “mathematical complexity” of PISA items, that is, higher demands on abstract reasoning, analysis, and judgment. At the same time, Neidorf et al. (2006) state that, although administered in grade 9 or 10, the vast majority of PISA items refer to content taught at grade 8, while TIMSS was more aligned with the curriculum.

19

Comparison of Studies: Comparing Design and Constructs, Aligning Measures,. . .

521

Smithson (2009) confirmed that PISA and the high school curriculum were “quite different at fine-grain topics level and the levels of expectations.” Another US group of subject matter experts judged PISA as being “quite weak in mathematical content,” stating “this is a problem-solving test and, although mathematics is used, that seems almost incidental” (Carmichael et al., 2009, p. 2). The same authors judged TIMSS as almost perfect in terms of content, rigor, and clarity. It should be noted, however, that assessment frameworks for both programs have changed since (which, by the way, threatens trend analysis, i.e., comparison between cycles) and may be more similar now. Yet, analyzing mathematics items from PISA 2012 and TIMSS 2011, Hole et al. (2018) found that TIMSS included a larger proportion of items referring to mathematical formulae or theorems. Differences in testing frameworks do not only reflect the life skills (OECD) vs. curriculum (IEA) divide, but they also show up when comparing, for example, PIAAC and PISA (Gal & Tout, 2014). Measures of socioeconomic status (SES) and SES-related constructs: The very first ILSA, the FIMS study, used parental occupation as the sole indicator of students’ family background (Husén, 1967). In general, SES concerns the “position of individuals, families, households [. . .] on more dimensions of stratification. [. . .] These dimensions include income, education, prestige, wealth, or other aspects of standing that members of society deem salient” (Bollen et al., 2001, p. 157). Thus, a wider variety of indicators may be used. In terms of Bourdieu’s sociological theory of social, cultural, and economic capital, “Number of books at home,” probably the most popular measure of SES used in current ILSAs, especially in IEA studies, can be interpreted as an indicator of “objectified cultural capital,” while parental education, also a popular measure, refers to “institutionalized cultural capital” (Sieben & Lechner, 2019). OECD attempts to cover all facets of the concept in its “Index of economic, cultural, and social status (ESCS)” (OECD, 2016) which combines measures of parental occupation, parental education, “number of books at home” and other home possessions, applying IRT scaling for home possessions and principal components analysis across the three major facets. Thus, PISA assumes unidimensionality of SES indicators, which is questionable (Watermann et al., 2016; Ye et al., 2021). Obviously, these measures are not exchangeable. Even simple measures labeled identically may be incommensurable. Number of books at home, for example, was asked of TIMSS grade 8 students exactly the same way from 1995 up to 2019, labeling, for example, the second to lowest category as “enough to fill one shelf (11-25 books),” and warning “Do not count magazines, newspapers, or your school books.” In grade 4, students were provided with little graphs illustrating the respective number of books, while their parents were asked the same question with just numbers as options. PISA 2000 also asked students about the number of books in their home, yet the options were different (just numbers: 1–10, 11–50,. . .), informing that there are usually 40 books per meter of shelving, and warning to exclude magazines only. Later, PISA adopted the same grouping of options as TIMSS (0–10, 11–25, 26–100,. . .). Similarly, TIMSS and PISA are using different lists of home possessions to assess family wealth, and different ways of assessing parental

522

E. Klieme

education. These examples suggest that the same measure may have different meanings across cycles and groups of respondents within one ILSA program, as well as between programs. Moreover, students and parents weakly agree on SES-related information (Watermann et al., 2016). ESCS measures are not equivalent across PISA cycles because of changes in question stem and options (as shown above), a new “International Standard Classification of Occupations (ISCO)” being applied from 2012 on, parental education being converted into “years of schooling” in different ways, and changes in OECD membership effecting OECD-standardized scoring. Therefore, OECD is using a separate “trend ESCS measure” for the analysis of trends in educational equity (OECD, 2016, p. 310). More revisions are under way (Avvisati, 2020). Items and scaling methods have changed across waves in TIMSS as well, so that trend analysis requires careful construction of a harmonized measure (Broer et al., 2019). Thus, researchers wishing to compare SES-related measures or derivatives such as indices of equity and inequality (Strello et al., 2021) or academic resilience (Ye et al., 2021) across ILSA instruments, waves, and/or programs need to check instruments very carefully. Questionnaire scales: Caution is also required when comparing other kinds of questionnaire scales and indices. Again, this includes changes in the “same” measure across cycles within one program. For example, TIMSS asked students about mathematics teaching practices in their classroom four times between 1995 and 2007. Although about 15 items were used for that question in each cycle, only four were kept unchanged (e.g., “We work together in pairs or small groups”), and the response scale was revised in 2003 (Rozman & Klieme, 2017). From 2011 on, TIMSS changed from asking about specific practices to more general qualities of teaching, such as instructional clarity and “disorderly behavior in mathematics lessons” – adapting the “generic” approach that had been introduced by PISA (see Klieme & Nilsen, this volume). Thus, while making comparison of mathematics teaching between cycles (i.e., trend analysis) impossible, TIMSS now allows for better alignment with PISA in that area. Yet, even as constructs are overlapping, there are differences in construct labels (TIMSS: “disorderly behavior,” PISA: “disciplinary climate”), in item wording (among the six and five items, respectively, measuring that construct, only two are phrased identically, while in others there are subtle differences like TIMSS: “There is disruptive noise,” PISA: “There is noise and disorder”), and in response options (both TIMSS and PISA apply a four-point frequency scale, yet labels are different, for example, TIMMS: “Every or almost every lesson – About half the lessons -. . .,” PISA: “Every lesson – Most lessons - . . .”). As a consequence, both measures of classroom order would surely qualify as indicating the same construct, for example, in a meta-analysis, but measures cannot be directly compared. As Wu (2010) notes, PISA and TIMSS also share some attitudinal constructs, while measuring them with different sets of items and using different labels (e.g., TIMSS: “Self confidence in learning mathematics,” PISA: “self-concept in mathematics”). Comparing TIMSS 2015 and PISA 2015, a similar situation has been

19

Comparison of Studies: Comparing Design and Constructs, Aligning Measures,. . .

523

observed for measures of instrumental motivation and enjoyment of science by He et al. (2019). These authors also discuss one pair of variables that even have identical names in TIMSS and PISA: “Sense of belonging.” Although there is one nearly identical item (“I feel like I belong at (TIMSS: this) school”), the measures are addressing quite different aspects of students experience: feeling good and safe at the school (TIMSS) vs. being well connected to peers (PISA). As a consequence, He et al. (2019) report differences between TIMSS and PISA regarding measurement quality and the pattern of correlations with student achievement. IEA studies are famous for their coverage of “opportunity to lean (OTL).” As these studies are curriculum-based, OTL indices can be based on teacher judgments about the specific topics covered in their classroom teaching; in addition curriculum documents and textbooks have been analyzed in detail (Travers & Westbury, 1989; Schmidt & Maier, 2009). When OTL in mathematics was taken up in PISA 2012, a different approach was taken: asking students about the types of problems they had experienced in class (Klieme et al., 2013). Using data from both TIMSS and PISA 2012, Luyten (2017) found that the PISA approach yielded much stronger relationships with both student achievement and socioeconomic status on the school level than the TIMSS approach, which may in part be due to the different sampling schemes. Luyten (2017) also reports that the two OTL measures were uncorrelated, but equally predictive of student achievement on the country level. Using a different approach again, PISA 2018 asked students about OTL for reading. In general, as the “major domain of assessment” rotates between reading, mathematics, and science in PISA, the reference for context measures rotates accordingly. Thus, although measures of “disciplinary climate” and “teacher support,” for example, have been kept unchanged across PISA assessment waves, variation between waves should be interpreted as a mixture of subject-matter effects and cohort effects. Mode of assessment and scaling: In 2015, PISA was for the first time administered on computer in all but a few countries, while TIMSS introduced computerbased assessment in 2019. There are further technical differences such as details of the Item-Response-Theory approach used (e.g., TIMSS includes a “guessing” parameter for Multiple-Choice items which PISA does not, and PISA 2015 introduced a more comprehensive approach for linking scales to previous waves of the assessment). Differences in mode and scaling may again limit comparability of measures between programs (see final section of this chapter) as well as across cycles within one program (Robitzsch et al., 2020). For the PISA reading assessment, Kröhne et al. (2019) confirmed construct equivalence between computerbased and paper-based administration, while observing some differences in item difficulty and the amount of missings.

Linking Measures from Different Studies Even if measures differ between ILSA programs or individual waves of assessment, findings may be compared using advanced statistical techniques. If there is some overlap in items, that is, a set of “anchor items,” data from two distinct measures may

524

E. Klieme

be mapped reciprocally through linking algorithms. Within multi-wave assessment programs, linking adjacent waves of student tests through common items is a regular practice (Mazzeo & von Davier, 2014). Both for TIMSS and PISA, linking errors are estimated and reported in technical documents (e.g., OECD, 2016; Stanco et al., 2010). In international studies, linking errors comprise assessment-by-item interaction as well as country-by-assessment-by-item interaction (Robitzsch & Lüdtke, 2019). Von Davier et al. (2019) applied linking across multiple waves of PISA. Johansson and Strietholt (2019) established links between five waves of TIMSS (1994–2011). Strietholt and Rosén (2016) applied linking across several IEA studies of reading at the end of primary school covering over 40 years, while Majoros, Roén, Johansson, and Gustafsson (2021) linked IEA studies covering more than 50 years of mathematics assessment. Linking techniques are also used for “trend” questionnaire scales, that is, scales used in multiple cycles (OECD, 2016). If two instruments (tests or questionnaire scales) refer to the same construct, but there is no overlap in items, a linking study administering both tests to the same group of students may allow to establish a common scale. This approach (joint calibration) has sometimes been used to combine international and national assessments. For example, a study commissioned by the NCES (2013) linked TIMSS Grade 8 Mathematics to the national assessment of educational progress (NAEP) in the USA. When estimated results based on linking from NAEP were compared to actual TIMSS results, which were available for nine “validation states,” mean achievement was slightly underestimated, probably due to differences in population coverage. Applying adjustments for sampling and exclusion, pretty close estimates were attained. A much simpler linking method (“moderation,” i.e., applying a transformation function) yielded similar results, so this method was ultimately used to produce TIMSS-estimates for all 50 US states. For further approaches to linking TIMSS with NAEP, see, for example, Lim and Sireci (2017). In Germany, PISA and TIMSS (grade 4) mathematics and science tests have been successfully linked to measures from the German national Education Panel Study (NEPS) through the equipercentile method (Ehmke et al., 2020). Other research has cautioned against aligning test scales (for the US: Loveless, 2008). When PISA (reading and mathematics) and PIRLS tests have been compared to national standards-based assessments in Germany (for an overview, see Jude et al., 2013), authors found that modeling international and national tests as two separate, although correlated dimensions fitted the data better than unidimensional models. This does not rule out equipercentile linking – yet, differences in meaning need to be taken into account. Finally, Hastedt and Desa (2015) studied the use of released TIMSS mathematics items for mapping national assessments to the international TIMSS scale. This was a simulation study based on data from three developing countries. Even with 30 anchor items, the simulated linking error was still very high, so that results from a national assessment would not be in line with findings from TIMSS proper. As a conclusion, the authors advised against using released items to link national assessments to ILSAs. Given that linking methods are so common in ILSA, and there are so many attempts at linking ILSAs with national assessments, it is striking to note that there

19

Comparison of Studies: Comparing Design and Constructs, Aligning Measures,. . .

525

does not exist any published empirical study assessing both TIMSS (Grade 8) and PISA test items in order to establish a common scale. Some attempt was made in a national extension to PISA 2000 in Germany, where 15 mathematics items from TIMSS (the overlap between lower and upper secondary mathematics assessments in TIMSS 1995) were administered to 2174 students in addition to the PISA items. The national PISA report only briefly mentioned that study (Klieme et al., 2001). Authors were able to calibrate all items on a joint, unidimensional scale with acceptable fit. Compared to the TIMSS items, PISA covered a wider range of difficulties, especially at the upper (more difficult) end. In a two-dimensional scaling model, the latent correlation between TIMSS and PISA was 0.91. Future research would greatly benefit from a full, multi-country linking study based on current measures from TIMSS and PISA. Some researchers have tried to link TIMSS with PISA without joint calibration from any specific linking study. Hanushek and Woessmann (2015) have gained widespread attention among researchers and policy makers for constructing a comprehensive international data base built from various waves of both TIMSS and PISA, using the US National Assessment of Educational Progress (NAEP) as a tertium comparationis. However, they assume that differences in sampling (e.g., grade-based vs. age-based, or changes in population coverage) are negligible, they generalize links based on US data across countries (assuming there is no country dif in linking), and they apply a relatively weak linking method. Still weaker, and more presuppositional approaches for establishing common scales will be discussed in the next section.

Searching for Universal Descriptors and Scales Researchers have for a long time dreamed of aligning test scales on conceptual grounds only, circumvening the burdensome implementation of empirical linking studies. The goal is to come up with a universal, synthetical proficiency scale that can be matched onto all kinds of assessments, say, in reading or mathematics, whether national or international. There seems to be a prominent example for such a “magic bullet,” namely the Common European Frame of Reference for language learning (EFR), issued by the Council of Europe (2020), with a history of about 30 years. This framework establishes a hierarchy of six levels (from A1 to C2) to be reached in learning foreign languages. Requirement for language competence as well as certification thereof is often described in terms of those levels. Nevertheless, the EFR is just a framework, not an assessment. It is based on “Can do”-statements that have been collected in adult learning contexts and judged by experts. The levels are inferred from those judgments. Tests of foreign language competencies – across multiple languages – often claim to assess the test taker’s competence according to the EFR hierarchy. (For an overview covering no less than 6 tests each for English and German, plus single tests for five other languages, see https://www.fremdspra chenzentrum-bremen.de/1072.0.html.) However, as there is no international

526

E. Klieme

prototype meter in this field, such claims can only be supported by complex validity arguments, including standard setting using CEFR “Can do” statements as proficiency level descriptors (North et al., 2009). To support the analysis of reading and listening item demands in respect of the CFER, an international team of experts provided an online grid (Alderson et al., 2006). This grid may be used to align ILSA measures of reading competence with the CFER, but to the knowledge of the present author it has not been used for that purpose yet, probably because CFER is tailored toward learning foreign languages. Instead, UNESCO recently started a “Learning metrics Partnership,” establishing a “Study group on measuring Learning Outcomes” at its Center for Global Development (UNESCO, 2019; Anderson, 2019). This was supposed to help with implementing the United Nations Strategic Development Goals, especially with defining indicators for SDG 4, “Ensure inclusive and quality education for all and promote lifelong learning opportunities for all.” Other brands used by UNESCO and its allies include the Global Alliance to Monitor Learning (GAML) and the Assessment for Learning Initiative (A4L). Part of A4L is the Analysis of National Learning Assessment Systems (ANLAS) project. Operationally, the Australian Council of Educational Research (ACER), who has been instrumental in launching PISA and running it from 2000 till 2012, is in charge of implementing ANLAS. The goal is “to provide a resource for developing country partners to build effective and sustainable learning assessment systems for evidence-based decision making in education policy and practice and to support education sector planning. (. . .) The ANLAS manual and a set of templates are provided to support the implementation process and to guide the analysis. Countries can adapt the tools to best fit the national context” (GPE & ACER, 2019). As part of GAML, the UNESCO Institute of Statistics in collaboration with ACER developed so-called common reporting scales. These numerical scales are associated with substantive descriptions that explain levels of proficiency in the learning domains identified in the SDG indicator framework, that is, reading and mathematics. Test questions from a variety of assessment programs were ordered by difficulty using statistical methods and pairwise expert judgment, and experts subsequently judged the skills required to answer each of the questions correctly. These skill descriptions form the basis of the reporting scales (Adams et al., 2018). For example, 533 items were used to establish the mathematics scale. About 5% of those were taken each from TIMSS and PISA, who serve as the “cornerstones” of the new universal mechanism (Turner et al., 2018, p. 42 and p. 48). Another approach at finding common descriptors for large-scale tests in mathematics and language arts has been started through an International Research Network initiated by the Center for Research on Standards and Student Testing (CRESST) in Los Angeles. Based on feature analysis (Baker et al., 2014) participating research teams will provide qualitative ratings of content distribution, cognitive demands, and task elements, and using whatever data is available (national or international), the impact of these features on item difficulty, discrimination, and other item parameters will be checked, while the group does not aim at establishing any universal scale. Even more challenging is the establishment of a common universal language in less

19

Comparison of Studies: Comparing Design and Constructs, Aligning Measures,. . .

527

well-defined areas of measurement, such as so-called “21st century skills,” where cultural and educational context matters even more (Jaberian et al., 2018).

Combining Data Across Assessments in Educational Research As discussed in the previous section, it is difficult to directly compare figures from different ILSA waves or programs, because that would require linking measures, mapping them on the same scale, and ensuring equivalence of sampling schemes. Nevertheless, researchers have come up with a variety of creative research designs which make use of combined data without neglecting the difference in measurement and sampling. One such design, the analysis of trend data and the comparison between different studies on the country level, will be discussed in the next and final section using the case of TIMSS and PISA. The present section presents a variety of research studies combining data from two or multiple assessments, mostly on student or school level. Kuger and Klieme (2016) reported a growing number of studies explicitly applying and modeling the differences between studies in order to answer certain research questions. Strietholt and Scherer (2018) provide a review of studies combining individual and institutional data sources in ILSAs. Lindblad and Pettersson (2019), evaluating the “intellectual and social organization of ILSA research,” report that among the 518 research papers based on TIMSS or PISA data they found, no less than 56 used data from both programs. In order of complexity, three different approaches to combining across studies can be discriminated: (i) integrating findings of analyzes run on two or more data sets in parallel, (ii) different ways of matching data sets, and (iii) enhancing sampling schemes to build bridges between separate studies, allowing for longitudinal analysis on the individual or institutional level.

Integrating Diverging and Converging Findings Across Studies Andrews et al. (2014) compared the high success of Finland in PISA with its modest results in the TIMSS grade 8 mathematics assessment, calling this contrast “an Enigma in Search of an Explanation.” Based on in-depth studies, such as interviews with teachers, they concluded that the PISA successes was most probably due to more general factors in the Finnish society and education system, rather than mathematics didactics and classroom practices, which were in fact more modest in comparison to what was known from other countries. Years later, measures of teaching quality in PISA 2012 (Schiepe-Tiska et al., 2013) as well as video studies conducted in the Nordic countries (Klette et al., 2017) confirmed that conclusion. More examples of divergent findings from TIMSS and PISA have been reported before in the present chapter. They referred to grade repetition effects and gender differences (Cardoso, 2020, and the case of Germany), OTL in mathematics (Luyten, 2017; Schmidt et al., 2018), and “sense of belonging” in school (He et al., 2019). Rather than criticizing such divergence as an indication of flaws, errors, or

528

E. Klieme

arbitrariness (“policy evidence by design”: Cardoso, 2020), researchers increasingly try to interpret them as substantial findings. He et al. (2019), for example, found good measurement quality and coherent findings for TIMSS and PISA measures of instrumental motivation and joy of science; the “problem” in findings on sense of belonging indicated important multi-dimensionality and cross-cultural variation in that specific construct rather than general inadequacy of noncognitive measures in ILSAs. Recently, the construct of academic resilience, that is, students’ capacity for high performance despite disadvantaged background, has become quite popular in educational studies. ILSAs provide a multitude of related, and sometimes contradictive findings. Again, differences in how resilience is operationalized help understand that construct rather than undermining the research base. Ye et al. (2021) identify 20 ILSA studies applying measures of socioeconomic status and achievement, different approaches to setting thresholds, and consequently, different classifications of individual students as resilient or nonresilient. Variations in classification lead to differences in how resilience relates to economic context, language background, and gender. As a consequence, the authors advise to use country-specific thresholds to avoid classifications mainly depending on the countries’ status in economic and social developmental. Borgonovi et al. (2018) as well as Solheim and Lundetræ (2018) discuss whether different sizes of the gender gap in reading as measured in PIRLS, PISA, and PIAAC can be interpreted as indicating developmental patterns across the life span. While the papers differ in their interpretation of ILSA findings, they agree that certain assessment features such as text types, item formats, aspects of reading covered, and administration details may contribute to the divergence in gender gap measures. Brunner et al. (2018) mention that their estimation of between-school-variation in student achievement (intraclass-correlations) based on PISA is larger than estimations reported by Zopluoglu (2012) based on TIMSS data. They attribute this divergence to differences in sampling schemes and the nature of the achievement measures. From a methodological point, the interesting point is that such divergence may lead to different recommendations for sample sizes when planning for grouprandomized trials in education. Once again, the difference in study design and findings is used to inform further research, rather than criticize existent findings. It should be stressed, however, that quite often findings are coherent across ILSA studies. Apart from country-level indicators of mean student achievement and trends in achievement, as discussed in the next section, there is much consistency in how achievement relates to so-called noncognitive measures. The size of those relationships is a fundamental research question in educational research and psychology, and it can be answered based on ILSAs. Lee and Stankov (2018) integrated findings on 65 noncognitive variables from various waves of TIMSS and PISA in a metaanalytic kind of framework. They found self-efficacy beliefs, confidence, and educational aspirations to be most predictive for student achievement. Yet, the direction of impact cannot be studied in ILSAs unless – as discussed below – studies are combined to set up longitudinal designs.

19

Comparison of Studies: Comparing Design and Constructs, Aligning Measures,. . .

529

Matching Data Many educational systems, most prominently England and the USA, provide broad coverage of student achievement data for school accountability and educational system monitoring purposes, and these data are available for secondary analyzes, allowing all kinds of sophisticated research questions to be answered. In contrast, data from ILSAs are sample-based only. Yet, in small countries, ILSAs are implemented as a census, or close to a census, in order to meet international criteria for sample sizes. In principle, this allows for direct integration of data across assessments – whether from different ILSA programs or waves within one program – at least on the institutional (school) level, if not on the individual level. For example, data from TIMSS and PISA, or PISA and TALIS could be matched if school, teacher, and/or student IDs would be shared across studies in those countries. Nonetheless, the present author does not know of any such study apart from an explorative analysis of data across two waves of PISA – more precisely: a national enhancement of PISA 2000–2003 assessing all 16 German Laender – from a few small German states (Klieme & Steinert, 2009). Kaplan and Turner (2012) provide an approximation to full data matching through advanced statistical methods sometimes called “data fusion,” and they evaluated the feasibility of this approach using real census-like data from TALIS 2008 and PISA 2009 in Iceland. The paper presents a systematic evaluation of a set of statistical matching methods focused on the goal of creating a synthetic file of PISA 2009 and TALIS 2008 data for Iceland. In the summary of their paper, the authors conclude: “The experimental study provides a proof of concept that statistically matching PISA and TALIS is feasible for countries that wish to draw on the added value of both surveys for research and policy analysis.” Nevertheless, to the knowledge of the present author, the approach has not been used since. Another creative way of combining data has been explored by Meroni et al. (2015). By identifying subsamples of teachers within the PIAAC samples, they were able to study the relationship between teacher skills – as measured in PIAAC – and student achievement – as measured in PISA – on the country level. Following a suggestion by Gal and Tout (2014), several researchers used a specific kind of “matching” to combine PISA with PIAAC data, assuming that the sample of students assessed at age 15 in PISA 2000 or 2003 is comparable to the subsample of age 23 to 27 assessed in PIACC 2011/12. Multiple papers adopting this approach have been published since 2016. While Williams (2019) found that PISA 2000 scores accounted for 70 percent of the cross-national variation in PIAAC, several authors used the design to study the change of inequalities from adolescence (age 15) to adulthood (age 27). According to Borgonovi et al. (2017), socioeconomic disparities were exacerbated, while the gender gap in reading literacy vanished. Cathles, Ou, Sasso, Setrana, and Veen (2018) found some convergence of the skills gap between the second-generation immigrants and the natives over time, while for first-generation immigrants the gap in literacy skills compared to both natives and second-generation immigrants has increased over time. Dämmrich and Triventi (2018)

530

E. Klieme

enhanced the method by adding primary school data (TIMSS; PIRLS), establishing a “pseudo-cohort approach” to study inequalities in cognitive competencies from primary school to young adulthood among 15 OECD countries. Overall, social inequalities in competencies tended to persist (reading) or increase (mathematics) over the early life-course, with some weak evidence that this is true even more among highly tracked systems. Gustafsson (2016) elaborated the approach further by incorporating data from all PISA cycles and multiple corresponding age groups in PIAAC. Country-level trends (2000–2012) in student achievement were strongly related to differences between corresponding age groups in PIAAC – a correlation that Gustafsson interpreted as a “lasting effect of quality of schooling” on adult performance. IEA’s TIMSS program provides even better opportunities for running cohortbased analyses, because it is administered every 4 years in grade 4 and grade 8, sometimes additionally in grade 12. By systematically combining primary school data (TIMSS grade 4, PIRLS) with secondary school data (TIMSS grade 8, PISA) across many countries, Hanushek and Wößmann (2006) and more recently Strello, Strietholt, and Steinmann (2021) have studied the effects of early tracking on educational inequality. The latter group of authors used data from 75 countries covering 20 years, combining a total of 21 cycles of primary and secondary school assessments to estimate difference-in-differences models on the country level. To avoid confounding effects when determining changes between primary and secondary school, they applied two matching approaches: Matching roughly the same years (e.g., PIRLS 2001 with PISA 2000) avoids cohort effects, while combinations from the same cohorts (e.g., TIMSS-Grade 4 in 2011 with TIMSS-Grade 8 in 2015) are subject to period effects. The authors found strong evidence that tracking increased social achievement gaps, but no evidence that tracking increased performance levels. The authors finally stressed the added value of combining data across studies because of the low reproducibility of findings based on single international datasets.

Longitudinal Designs Compared to synthetic samples or pseudo-cohorts, real longitudinal (panel) designs are of course more robust and provide stronger foundations for claims on the development of competencies and inequalities, and for causal analyses on the impact of student, family, class, teacher, and school factors on student outcomes. Therefore, research teams from several countries have added national follow-up assessments to individual ILSA studies. A review of such national enhancements is out of range of the present chapter, but it is worth noting that so far only three international studies implemented truly longitudinal designs across multiple countries in the context of ILSA: Burstein (1993), Panayiotou et al. (2014), and Opfer et al., 2020 (the TALIS-Video Study, which included about 650 mathematics teachers from East Asia, Europe, and Latin America all teaching quadratic equations). Compared to designing and administering a follow-up to some specific study, combining two ILSAs to allow for longitudinal analyzes may be considered a very

19

Comparison of Studies: Comparing Design and Constructs, Aligning Measures,. . .

531

elegant, efficient approach. So far, such an approach has been implemented combining PISA and PIAAC or TIMSS and PISA on the individual level, and combining two waves of PISA on the school level. As Maehler and Konradt (2020) point out, three countries administered PIAAC tests to young adults who had previously participated in PISA: Denmark, Singapore, and the USA. Singapore implemented this follow-up in a subsample of PISA 2009, the USA started with PISA 2012, and no findings have been reported yet. In Denmark 1881 participants from PISA 2000 were retested and interviewed again in PIAAC 2011–2012. Sørensen et al. (2019) analyzed the data of those individuals who had already entered the labor market in 2012 to investigate the return from cognitive and noncognitive skills – as measured at the age of 15 (PISA) and after entering the labor market (PIAAC) – to labor market outcomes, namely earnings and employment rate. Cognitive skills turned out to be important for both outcomes, and noncognitive skills turned out to be important for earnings independent from the timing of the acquisition. Carnoy et al. (2016) report on a unique data set: one country (Russia) applied the PISA mathematics test in 2012 in ninth grade to all students who had taken the Trends in International Mathematics and Science Survey (TIMSS) test in 2011 and collected information on students’ teachers in ninth grade. These data allowed them to estimate the effects of classroom variables on students’ PISA performance, controlling for students’ pre-knowledge. Similar to findings from a national follow-up to PISA 2012 in Germany (Kuger et al., 2017), the authors report that effects of teaching quality and opportunity-to-learn (i.e., types of mathematics problems covered in school) on student performance were much more modest than found in cross-sectional analyses, with OTL for formal mathematics (i.e., school algebra) being the strongest predictor. Instead of assessing individual students longitudinally, data from two or more waves of any ILSA may be used to study change on the school level. Klieme and Steinert (2009) explored this approach in a feasibility study using non-representative German data from three waves of PISA, while Bischof et al. (2013) reported on a panel of 50 schools that took the PISA test both in 2000 and in 2009. The latter study found positive effects of establishing all-day-schooling on school climate, while internal school evaluation had a positive effects on both school climate and achievement. Implementing such an approach would be most easy in very small countries, where a larger group of schools, sometimes even all schools in the country, will participate in consecutive waves, providing an extraordinary opportunity to study conditions of school improvement and school decline. Unfortunately, to the knowledge of the present author, such a study has not been implemented full scale yet, although it could provide valuable insights for educational policies.

Analyzing ILSA Data on the Country Level Some of the papers cited in the previous section, when integrating findings or matching data across ILSA studies, actually argue on the country level. The more cycles (waves) of ILSAs researchers can use, the stronger such analyses and

532

E. Klieme

arguments may become, including analyses of country-level trends. In the following, we use the example of TIMSS and PISA to discuss the opportunities and limitations of measuring country-level trends based on ILSA data. Both programs have been implemented within the same year in 2003 and again in 2015, offering a perfect data base for this discussion.

Comparing TIMSS and PISA Achievement Results on the Country Level Despite the differences in test design discussed above, TIMSS and PISA provide similar pictures of student-achievement on the country level, as can be seen when regressing PISA 2015 country level mean scores on TIMSS 2015 grade 8 country level mean scores in mathematics. Across the 27 systems participating in both studies, there is a close alignment between country mean scores from both studies (Fig. 1). (As this analysis is focusing the country level, we do not consider subnational entities in our analysis. This rules out the United Kingdom, because only England participated in TIMSS 2015. Kazakhstan and Malaysia are not included because of problems with data quality. Norwegian TIMSS data are reported for grade 8 throughout, although Norway reported grade 9 data as well in TIMSS 2015 (Mullis et al., 2016).) The coefficient of correlation is 0.923, indicating that 85% of the between-country variation in PISA Mathematics Literacy can be explained by TIMSS, and vice versa. It is worth noting that Science scores were equally well aligned on the country level in 2015: The coefficient of correlation was 0.926, accounting for 86% of between-country variance. The two-dimensional layout of Fig. 1 helps identify a pattern that would not be perceived as easily using just one of the studies: East Asian countries (including OECD-members Japan and Korea) on the upper end, countries from yet developing regions like Near and Middle East (including OECD-members Turkey and Chile) on the lower end are forming clusters with similar profiles of student achievement in TIMSS and PISA, while European OECD-members, English-speaking countries, Russia, and Lithuania belong to the central cluster. This pattern would of course look different if an even more diverse set of countries would implement both TIMSS and PISA, but basically this pattern can be found in many international Large-Scale Assessments. This includes some minor, but typical deviations from the overall linear relationship: The top-achieving East Asian systems seem to do a little better in TIMSS Mathematics than you would expect from their PISA results, while some Nordic and English-speaking countries (Norway, Sweden, Australia, Canada, Ireland, and New Zealand) are doing a little better in PISA. This is in line with the pattern that previous research has found in 2003. In 2003, the TIMSS grade 8 mathematics scores and the PISA Mathematics Literacy scores were also highly correlated. Based on the 17 countries which administered both tests, the coefficient was 0.867, and the pattern of deviations from the linear regression line was much like in 2015: East Asian countries doing a

19

Comparison of Studies: Comparing Design and Constructs, Aligning Measures,. . .

533

Fig. 1 Relationship between country mean scores for TIMSS 2015 (Mathematical Achievement) and PISA 2015 (Mathematics Literacy). The straight line illustrates the linear regression

little better in TIMSS, Nordic and English-speaking countries doing a little better in PISA. Yet, the relationship is not perfect. As explained above, differences in sampling (grade-based vs. age-based) and differences in facets of mathematics being covered are the most important factors distinguishing the TIMSS assessment design from the PISA assessment design. In order to take the first factor in account, Wu (2010) used the index of “mean student age” in TIMSS: the older TIMSS participants within a certain country are on average, the more similar they are to the PISA sample. In order to take the second factor into account, Wu (2010) developed an index of “content advantage” for each country. The index estimates how country results in TIMSS would change if content areas within mathematics (such as Number, Algebra, or Data and Uncertainty) would have contributed the same share of items in the TIMSS test as they did in PISA. Both indices – mean student age in TIMSS, and content advantage – were used to predict country-level PISA scores. While TIMSS 2003 scores alone accounted for 71% of between-country variance in PISA 2003 math scores, adding the two indices allowed Wu (2010) to explain 93% of the variance. (In addition to the 17 countries covered in our analysis (see Fig. 2), Wu included five subnational regions; therefore her figures slightly differ from ours.) The conclusion

534

E. Klieme

Fig. 2 Relationship between change in country mean scores between 2003 and 2015 for TIMSS (Mathematics Achievement) and PISA (Mathematics Literacy). The straight line illustrates the linear regression

was: Differences in student sampling, plus differences in test content account for most of the discrepancies between TIMSS and PISA math scores observed on the country level. However, the explanation of discrepancies by features of the assessment design provides no substantive insights. One factor that may be more relevant for research and policy making is Opportunity to learn (OTL). OTL has been studied extensively in IEA studies and shown to be an important factor explaining differences in student outcomes (e.g., Burstein, 1993; Schmidt & Maier, 2009). The more (and deeper) content students are exposed to, the better their results in Large-Scale Assessments. As TIMSS is focused on curricular content, it should convey information on OTL on top of the general level of mathematical competencies that is assessed in PISA. Thus, it can be expected that the small discrepancies left between TIMSS grade 8 and PISA mathematics scores can at least partly be explained by students’ opportunity to learn mathematical content. In 2011/2012, both TIMSs and PISA (which at this time was focused on mathematics as its major domain) included measures of Opportunity to learn:

19

Comparison of Studies: Comparing Design and Constructs, Aligning Measures,. . .

535

• TIMSS asked teachers to judge to what extent their students had been taught core curriculum elements. Altogether, the survey covered 19 topics from the areas of Numbers, Algebra, Geometry, Data, and Chance. TIMSS 2011 reported countrylevel indicators for “Percentage of Students taught the TIMSS Mathematics topics.” • PISA asked students to judge their familiarity with mathematical concepts. There was a list of 13 mathematical terms like “exponential function” or “arithmetic mean.” An overall score of familiarity with mathematical concepts was developed and aggregated on country-level. (Three “foils,” that is, concepts that in fact are not established in mathematics, were added to the list. These were used to correct for guessing and response bias (Kyllonen & Bertling, 2014).) Assuming Opportunity to learn is quite stable on the country level, these measures may be used to explain TIMM 2015 results as well. Each of them has a significant contribution. The proportion of TIMSS between-country variance accounted for is increasing from 85% (using PISA scores as the only predictor) to 96% (using both OTL measures on top). If we add mean age of students participating in TIMSS 2015 as a fourth predictor – accounting for the differences in sampling –, all four predictors significantly contribute to explaining country-level TIMSS scores, and overall they account for 97.4% of the variance. These analyses clearly show that TIMSS scores – although being closely related to PISA scores on the country level – carry additional information related to the quality of the mathematics curriculum implemented in classrooms.

Comparing Country Level Trends Based on TIMSS and PISA In order to find out if TIMSS and PISA provide coherent views on the change of mathematical achievement levels between 2003 and 2015, we calculated the difference between TIMSS 2015 scores and TIMSS 2003 scores for each country participating in both studies. Similarly, we calculated the change in country level means for the 2015 and 2003 PISA Mathematics Literacy assessments. Both TIMSS and PISA change scores are available for 11 countries (see Fig. 2). The change scores correlate substantially (r ¼ 0.612, p 100 books). The model includes the following predictor variables reflecting students’ access, use, and familiarity concerning ICT: • Number of computers at home: Students reported the number of desktop and portable computers, the resulting indicator variable reflects the numbers of computers starting 0 (no computer) up to 3 (three or more computers), and regression coefficient indicates the increase (or decrease) in the dependent variable with one more computer at home.

45

Digital Competences: Computer and Information Literacy and Computational. . .

1291

• Experience with computers: This variable reflects how long the individual student has used computers and was coded as 0 (never or less than a year), 2 (between 1 and 3 years), 4 (at least 3 but less than 5 years), 6 (at least 5 but less than 7 years), and 8 (more than 7 years) so that regression coefficient indicates the increase (or decrease) in the dependent variable with 2 more years of experience in using computers. • Students’ reports on learning of CIL-related tasks at school: The index is based on a set of eight items that required students to indicate whether they had learned about different CIL tasks at school. Values are IRT scores, which were standardized for this analysis within each country to having a mean of 0 and a standard deviation of 1. • Students’ reports on learning of CT-related tasks at school: The index is based on a set of nine items that required students to indicate whether they had learned about different CIL tasks during the current school year. Values are IRT scores, which were standardized for this analysis within each country to having a mean of 0 and a standard deviation of 1. The regression model (Table 4) explained 21% of the variance in students’ CIL scores on average, ranging from 16% in Italy to 27% in Uruguay. For students’ CT scores, the model explained 18% of the variance, ranging from 13% in Korea to 21% in France, Germany, and Luxembourg (in the German state of North RhineWestphalia, the model explained 23%). While female gender was positively associated with CIL in most participating countries, it was negatively associated with CT in all but one country (Denmark).

Table 4 Unstandardized regression coefficients for student background variables and explained variance Gender (female)

Country Chile Denmark Finland France Germany Italy² Kazakhstan¹ Korea, Republic of Luxembourg Portugal†† ¹ Uruguay ICILS 2018 average

(r)

(r) (r)

CIL 4.6 (4.4) 10.2 (3.2) 17.4 (3.0) 17.5 (3.1) 16.1 (3.6) 8.4 (3.6) 5.3 (4.3) 28.0 (3.8) 14.2 (2.7) 3.2 (3.1) 4.3 (5.2) 11.7 (1.2)

Not meeng sample parcipaon requirements (r) United States 14.5 (2.2) Benchmarking parcipants Moscow (Russian Federation) 3.7 (2.9) North Rhine-Westphalia (Germany) (r) 9.9 (4.0)

CT -6.8 (3.5) -13.9 (3.6) -10.9 (4.7)

-15.4 (4.4) -17.5 (3.8) -25.8 (3.5) -15.0 (1.6)

Test language use at home

CIL 32.6 (12.9) 20.1 (6.0) 31.3 (7.2) 27.0 (7.3) 33.5 (5.5) 11.0 (4.2) -21.1 (11.7) 36.6 (25.8) 8.7 (2.7) -3.2 (8.5) 19.1 (17.9) 17.8 (3.9)

-17.4 (3.1)

-0.8 (3.0)

-20.5 (4.8)

30.3 (8.8) 35.2 (5.9)

* Statistically significant (p 300) on self-concept, motivation, and the relations of these with achievement. This meta-analysis suggested positive effects of self-concept and motivation on achievement, with medium effect sizes (Hattie, 2009). Also aspects such as anxiety and engagement/effort had significant relationships with performance according to effect size measures even if correlations on average actually were rather modest. Self-efficacy was not included in Hattie’s meta-analysis. Stankov (2013) summarized findings from a large number of studies performed in different contexts in an effort to order noncognitive variables based on their predictability of cognitive performance. In the category with variables showing very weak correlations with cognitive performance were measures of general anxiety and wellbeing. Self-concept, self-efficacy, and test anxiety were demonstrated as successful predictors of academic achievement but not of intelligence. Interest and instrumental motivation were however also in this meta-analysis judged as poor predictors of achievement in mathematics. In conclusion, there is a high degree of overlap between findings in ILSA contexts and findings in non-ILSA contexts. Who came first, the chicken or the egg? Much of the above-cited research have employed analyses based on principles of regression (multilevel regression models, latent variable models), where one variable is said to predict or have an effect on another variable. Prediction in this context is however more a statistical measure than an indicator of temporal causal effect. The causal ordering of relationships between

46

Student Motivation and Self-Beliefs

1313

self-beliefs, motivation, and achievement is not easy to discern, even if efforts have been made over the years. The usual view taken is that relationships are reciprocal and that motivation leads to higher performance and better performance gives positive feedback and increases motivation to learn more (see for example Marsh & Craven, 2006). In their meta-analysis of longitudinal studies of the relation between selfbeliefs and academic achievement, Valentine, DuBois, and Cooper however found support for a small but favorable effect of positive self-beliefs on later performance (Valentine et al., 2004), more specific measures showing a stronger effect. In any case, relationships between noncognitive and cognitive variables are complex, and even if ILSA studies are not longitudinal or experimental by design, there are approaches that could be employed to test causal-type models of these relationships (see Rutkowski, 2016).

Self-Beliefs and Motivation in Different Countries and Cultures: Can Findings be Validly Compared? An important line of research on self-beliefs and motivation in ILSA contexts during roughly the last decade has been the research into cross-country comparability of the motivational measures. As evidenced by the research presented above, it is common to compare relationships between motivation and performance across different countries. But can such comparisons even be made? Cross-country comparisons have per definition always been a central feature of large-scale international studies. Cross-culture generalizability or universality of motivation theory has also been a standing issue in research on student achievement motivation (Pekrun, 2018; Tonks et al., 2018). Most motivation theories and instruments for measuring motivation have been developed in the “White, Western parts of the world,” and it is not self-evident that the dynamics of motivation are the same across countries and cultures or that the operationalization and assessment of motivational variables fits well in every context. Although caution when comparing different cultures and countries on self-report items assessing noncognitive constructs has always been advised in ILSA contexts, it is only more recently that more focused research on the comparability – or invariance in more psychometric terms – of motivational constructs have been performed, primarily using CFA/SEM frameworks (and sometimes in IRT environments). This development is likely a consequence of methodological developments, increased computational power, and availability of software enabling statistical analyses of these issues. Also, the motivational scales currently used in the ILSAs lend themselves better to these types of analyses than the early motivational measures did. Measurement invariance is about whether items and scales seem to be “working in the same way” in different groups of students. What is usually tested is a) configural invariance or whether the structure of the construct as such seems comparable across groups, b) metric invariance or whether relationships between variables can be interpreted in the same way, and c) scalar invariance or whether ratings on the variable (e.g., mean scores) can be interpreted in the same way across groups.

1314

H. Eklo¨f

Most efforts to study the cross-cultural comparability of motivational scales in ILSAs have arrived at the conclusion that constructs as such seem to be similar across countries and cultures and also that relationships between variables seem to be possible to compare across countries and cultures. What is usually not appropriate according to most published invariance studies is to compare mean values on motivational variables (intercepts, scalar invariance) across countries and cultures (Artelt, 2005; Nagengast & Marsh, 2013). He, Barrera-Pedemonte, and Buchholz (2019) took on the relevant task of investigating the cross-study construct comparability as well as cross-cultural comparability of noncognitive, motivational constructs in TIMSS 2015 and PISA 2015. More specifically, in terms of motivation, they investigated instrumental (extrinsic) motivation and enjoyment of science (intrinsic motivation) and noted that several constructs in the TIMSS and PISA context questionnaires show an overlap in theoretical concepts and item wording. Their invariance analyses supported configural and metric invariance. In line with other research, they also showed that mean scores on these noncognitive variables cannot be validly compared across countries, and they advise against such practice unless it can be shown that intercepts/discrimination can be proven equal across countries (He et al., 2019). Using PISA 2000 reading data, Artelt (2005) could in a similar vein show that reading interest (intrinsic motivation) and instrumental (extrinsic) motivation seem to be comparable constructs, irrespective of cultural background, while crosscountry comparisons of levels or motivation would suffer from different biases, as students in different countries seem to be anchored around different norms when they respond to self-report items such as those presented in ILSA questionnaires (Artelt, 2005). Already without statistically investigating cross-cultural comparability of ratings of motivation, one can begin to suspect that reporting of self-beliefs and motivation is not independent of context. A paradox that has been consistently evident in all ILSAs is, namely, that relationships between motivational variables and achievement tend to be positive within each country (as shown in the previous section, higher levels of self-concept tend to be associated with higher levels of performance), while they are instead negative across countries, if country averages are compared. Hence, students in high-performing countries, as a group, tend to report lower levels of selfbeliefs and motivation, while students in low-performing countries, as a group, tend to report higher levels. One notable exception is the self-efficacy measure in PISA which in the previously cited meta-study by Lee and Stankov (2018) was positively related to performance also on between-country level (from 2006 this is a very specific measure, almost a cognitive one). Otherwise they found the same results that have been reported in every ILSA – positive pan-cultural and within-country correlations between motivational variables and performance – while these correlations are reversed to negative on a between-country level. Different explanations have been put forward to explain this phenomenon. Herbert Marsh has proposed a “big-fish-little-pond effect” (BFLPE). This model, based in Marsh’s self-concept research, implies a negative effect of school average achievement on academic self-concept. Thus, individuals tend to base their

46

Student Motivation and Self-Beliefs

1315

self-perceptions on relative comparisons, and being in a high-performing surrounding would thus suppress perceptions of ability. This model has gotten ample empirical support, also in the ILSA environment (cf. Nagengast & Marsh, 2011). As noted above, different response styles have also been proposed as a likely cause of bias and incomparability of self-report items (see He & Van de Vijver, 2012). Students in different cultures may have different ways of responding to Likert-type self-report items, with some cultures emphasizing modesty, others selfenhancement. Some students may be more likely to agree with items in general, etc. It should be noted that simply claiming that high-performing countries report lower levels of motivation might be overgeneralizing, but this is how the general pattern looks. However, there have been studies using smaller units of analyses that show differences also between high-performing, low-motivation countries. Using TIMSS data from grade 4 and grade 8, Ker (2017) found motivational differences between Chinese Taipei and Singapore, two high-performing Asian countries, where students in Taipei reported much lower levels of self-concept and motivation compared to students in Singapore. The younger students (grade 4) overall reported a more positive self-concept and liking math, but the same country differences were apparent also in the fourth grade. Students’ valuing of math is not assessed in the fourth grade. The take-home message from this section is that one should not compare mean levels of motivation and self-beliefs across countries unless one can provide evidence that such comparisons are justified and that invariance testing is highly recommended whenever groups are to be compared. The invariance issue is not only relevant for cross-cultural comparisons, but also comparability across, e.g., genders is important to investigate if such group comparisons on the variables of interest are going to be made.

Other Group Differences in Levels of Motivation and Self-Beliefs As shown above, there are consistent and significant differences between countries and cultures in terms of reported levels of motivation and self-beliefs, and the pattern is that in high-performing environments, students on average report lower levels of motivation and self-beliefs than in low-performing environments. Within environments, however, the expected pattern of positive relationships between motivation and performance is seen. Age also seems to be a determinant of reported levels of motivation and selfbeliefs. It is well documented in research that motivation tends to decline as children progress through school. Along with this, motivation also becomes increasingly differentiated, within as well as between school subjects. Also, self-perceptions become more accurate along with cognitive development and social comparisons, and young students’ self-perceptions of their abilities can be somewhat detached from their actual abilities (Brown & Harris, 2013; Eccles & Wigfield, 2002; Fredricks & Eccles, 2002). Thus, research has consistently shown that younger children are more positive in terms of self-concept and liking and valuing the subject

1316

H. Eklo¨f

than older students. Less secondary research on these issues have been presented in the ILSA contexts, much because they are not longitudinal studies and with the exception of TIMSS only measure students in one grade. Ker (2017) however used the expectancy-value theory to study changes in motivational variables from grade 4 to grade 8 and the effects of motivational variables on math performance in TIMSS 2011 in three different countries (Chinese Taipei, Singapore, and the USA). As suggested by motivation theory and previous research, self-reported motivation in relation to academic achievement seems to deteriorate with age. Grade 8 students participating in TIMSS 2011 reported lower levels of motivation than grade 4 students in all these countries. Again, of the motivational variables, self-concept was the strongest predictor of performance, in both grades. Another important line of research in the area of achievement motivation generally has been the research on gender differences in self-beliefs and motivation. Actually, the development of the expectancy-value theory is a result from Eccles and colleagues’ efforts to explain gender differences in mathematics achievement. Gender stereotypes seem to be strong, and studies over time show that typically, girls report lower levels of self-beliefs but also motivation in relation to mathematics (cf. Else-Quest et al., 2010; Ganley & Lubienski, 2016), while the opposite is often seen for reading. Research has also consistently showed that girls tend to report higher levels of performance-related anxiety than boys. The gender differences in science self-beliefs and motivation are in the same direction as mathematics, although less pronounced. Here, it also matters which part of science is involved; girls tend to be more confident in areas relating to health, for example (Sjöberg & Schreiner, 2010). Research in ILSA contexts have generally confirmed what previous research has shown although findings are not as consistent as regarding the relationships between motivation and performance: Patterns look slightly different across countries, over time, and across ages and subjects concerning gender gaps and levels of self-beliefs and motivation in relation to the different subjects and different age groups. Space does not allow for a closer look into these issues here, which however does not make them less important. The interested reader is referred to the international reports or the ILSA gateway (see also Ganley & Lubienski, 2016; Meece et al., 2006; Sjöberg & Schreiner, 2010).

How Is ILSA Motivation Research Situated within the Larger Motivation Research Field? Obviously, ILSAs are not designed as motivational studies per se, and this is also not their main purpose. As a consequence, they will likely not be driving the development of grand theory in the field of achievement motivation. After all, they use a small number of items that are responded to on a very restricted scale, to cover complex constructs that in turn are nested in a web of other constructs in most motivational theories in order to understand and explain human motivation. Nevertheless, ILSAs such as TIMSS, PISA, and PIRLS contain a number of relevant variables that at least in more recent administrations are carefully evaluated, and

46

Student Motivation and Self-Beliefs

1317

secondary analyses suggest they are appropriate to use in many national contexts. Actually, research using ILSA data fits rather well into the larger research field of achievement motivation. In the above review, it has already been indicated that findings from the ILSA studies seem to align well with empirical findings from non-ILSA contexts and with assumptions using motivation theory. It was also shown that researchers have made use of ILSA data to more explicitly test assumptions coming from motivation theory, e.g., self-concept theory. To some extent, the same developments that can be traced in non-ILSA motivation research are also visible in ILSA motivation research, and in some respects, ILSA contexts can be seen as an important platform for doing motivation research. ILSAs have the advantage of large and representative high-quality samples from all parts of the world, comprehensive standardized cognitive tests, and an array of other background variables that can also be used as controls and covariates in the analyses. These are rare features in much motivation research that historically often has used small convenience samples and unstandardized tests to explore research questions (Pekrun, 2018). ILSA-type data allow for testing partly other research questions and performing sophisticated and data-demanding analyses. In motivation theory and motivation research, it has often been argued that Western-biased theories and instruments cannot be transferred to other cultures. Here, the possibilities of large-scale cross-cultural studies are a considerable strength of ILSA studies. Nagengast and Marsh note that “The PISA research program offers unprecedented opportunities for testing the cross-cultural generalizability of predictions from educational psychology theories” (Nagengast & Marsh, 2013, p. 337). Also Tonks, Wigfield, and Eccles, prominent motivation researchers outside ILSA, argue that “secondary analyses using international databases could yield further fruitful findings” (Tonks et al., 2018, p. 109). Reviewing developments in motivation theory and research over the last 15 years (see Liem & McInerney, 2018), methodological developments and improved research designs (interventions, experiments, high-quality observation studies, large-scale studies, etc.) as well as an increased focus on social determinants of motivation and cross-culture comparability have been highlighted as areas that have developed but still in need of further exploration (Pekrun, 2018; Tonks et al., 2018). Although ILSA studies cannot provide longitudinal information and are not experimental, ILSA data may have great potential in exploring assumptions from theoretical models in large-scale, representative culturally different student samples, as has already been done with both TIMSS and PISA (see Pekrun, 2018). There are also methods to approach data so that causal hypotheses may be formulated and tested, at least tentatively (see Rutkowski, 2016). The ILSA motivational frameworks were reviewed above, and it was argued that their theoretical grounds are sometimes not entirely clear. However, as several secondary analyses have shown, the constructs assessed match well with, for example, self-concept theory or expectancy-value theory, and these theoretical frameworks can be used, and have been used, to interpret findings. There are also studies that have used other frameworks, such as the self-determination theory or the control-value theory (De Naeghel et al., 2014; Pekrun, 2018). These theoretical

1318

H. Eklo¨f

frameworks are also visible in more recent ILSA frameworks. A key issue for the usefulness of ILSA data in future motivation research is the quality of construct definition and the transparency (and continuity!) in operationalization and measurement of indicators of the construct. Measures should be sensitive to cultural differences but still equivalent enough to allow comparisons (Pekrun, 2018). It could also be discussed how many and how general noncognitive factors that should be assessed and how priorities between quality and quantity of measures should be set. Motivation research rather consistently show that domain-specific measures of motivation are more clearly related to achievement behavior and achievement choices than domain-general, but in, for example, PISA, there has been an increase in domain-general scales which often carry less valuable interpretation in relation to achievement, even if the variables are regarded as important outcomes from an educational effectiveness perspective. To conclude, theory and research on the dynamics of achievement motivation and its’ influences on achievement and future choices need more than data from ILSAs, but for testing theoretical assumptions on a large scale in a variety of contexts, ILSAs can contribute to the field, provided that the measures are good enough. Motivation theory could also more explicitly be used to support the empirical findings from ILSA secondary analyses, e.g., when it can be shown that self-beliefs, intrinsic and extrinsic, vary with performance and across groups according to theoretical assumptions.

Current Trends and Future Directions Lately, several large syntheses and meta-studies of relationships between motivation in performance and the invariance of measures using ILSA data have been published. These are important in corroborating findings from more local studies and may provide evidence to the “relative universality” of achievement motivation. Still, aggregated general patterns can hide unique and local features, and that is why continued research within single countries, or subgroups within countries, are still important when it comes to understanding student self-beliefs and motivation. Methodologically, much has happened since the first secondary analyses using ILSA motivation data, the measures are better, and the characteristics of the data is better acknowledged now than previously (e.g., multilevel analyses where the nested structure of the data is considered, analyses within CFA/SEM frameworks where the multidimensionality and complexity of the variables is adhered to invariance studies). Considering that there are many variables possibly interacting, more secondary analyses approaching mediation/moderation issues would be welcome, as would studies trying to at least tentatively test causal hypotheses. And as already noted, carefully developed measures are key to the usefulness of ILSA data in research on motivation. Student motivation is also an important outcome of education; this is also stressed in the educational effectiveness framework (IPO model) that underlies many ILSAs. Historically, it has been more common to study motivational variables as predictors

46

Student Motivation and Self-Beliefs

1319

of performance, while motivation as an outcome in its’ own right has received less attention. With more emphasis on the IPO model, student self-beliefs and motivation are also increasingly being studied as an outcome of different school and teacher characteristics (see, e.g., De Naeghel et al., 2014, Scherer & Nilsen, 2016), and this is likely an area that will expand. Further, students’ level of motivation is often discussed in relative terms (boys are less/more motivated than girls, younger students are more motivated that older students, more motivated students perform better than less motivated students), and arriving at an answer to the simple questions “are students motivated or not?” or “how motivated is motivated enough”? is actually more difficult than it may seem, and ILSA contexts may not be the optimal arena for providing answers to questions like these. Motivation is a subjective internal state, and self-report on a four-point Likert scale may not be very informative, in particular as different response biases may come into play. Still, ILSA data may be useful for further research on varying levels of motivation across time, countries, and age groups and for continued exploration of what role motivational variables seem to play, both as predictors and as outcomes. Self-report scales have always been criticized for being sensitive to bias, for being difficult to evaluate in any absolute sense, etc., but they are still the dominant mode for assessing individual characteristics. As ILSAs now are going computer-based, research into the possibilities of overt test-taking behavior of complementing selfreported motivation would be welcome. Not least, task-specific motivation, testtaking motivation, is also a motivational construct of interest to the low-stakes ILSAs (see Eklöf & Knekta, 2018), and here, computer-generated process data could be important in informing about how students seem to approach these tests, an issue that has recently also been acknowledged in the PISA context (see PISA 2018 international report).

Concluding Remarks In over a large number of ILSA studies and secondary analyses of ILSA studies, with basically no exception, self-beliefs such as self-efficacy and self-concept show positive associations with performance, however rarely more than moderate in size. More specific measures tend to be more strongly related to performance than more general ones. There is also ample evidence pointing in this direction in motivation research in general. Mixed results have been reported for other motivational constructs such as intrinsic and in particular extrinsic motivation, but even if they are not directly related to performance, they may serve important motivational purposes in relation to other outcomes. Group differences are often investigated in motivation research, and findings from ILSA data suggest that self-beliefs and motivation seem to be relatively universal when it comes to structure and relation with other variables but that levels of motivation cannot easily be compared across cultures.

1320

H. Eklo¨f

The research on self-beliefs and motivation in ILSA contexts has improved over time as a result of improved measures and methodological advances. Looking into the future, it seems important to keep working on the measures used for assessing self-beliefs and motivation while at the same time ensuring comparability over time and across groups. With appropriate conceptualization and appropriate measures of self-beliefs and motivation, international large-scale assessments could serve important purposes in research on student motivation and self-beliefs.

References Artelt, C. (2005). Cross-cultural approaches to measuring motivation. Educational Assessment, 10(3), 231–255. Bandura, A. (1997). Self-efficacy: The exercise of control. W. H. Freeman and Company. Boe, E. E., Turner, H. M., May, H., Leow, C., & Barkanic, G. (1999). The role of student attitudes and beliefs about mathematics and science learning in academic achievement: Evidence from TIMSS for six nations. CRESP data analysis report. Retrieved from http://repository.upenn.edu/ gse_pubs/413 Brown, G. T. L., & Harris, L. R. (2013). Student self-assessment. In J. H. McMillan (Ed.), The SAGE handbook of research on classroom assessment (pp. 367–393). Sage. Çiftçi, Ş. K., & Yıldız, P. (2019). The effect of self-confidence on mathematics achievement: The Meta-analysis of trends in international mathematics and science study (TIMSS). International Journal of Instruction, 12, 683–694. De Naeghel, J., Valcke, M., De Meyer, I., et al. (2014). The role of teacher behavior in adolescents’ intrinsic reading motivation. Reading and Writing, 27, 1547–1565. Deci, E. L., & Ryan, R. M. (1985). Intrinsic motivation and self-determination in human behavior. Plenum Press. Eccles, J. S., & Wigfield, A. (2002). Motivational beliefs, values, and goals. Annual Review of Psychology, 53, 109–132. Eklöf, H., & Knekta, E. (2018). Using large-scale educational data to test motivation theories: A synthesis of findings from Swedish studies on test-taking motivation. International Journal of Quantitative Research in Education, 4, 52–71. Else-Quest, N. M., Hyde, J. S., & Linn, M. C. (2010). Cross-national patterns of gender differences in mathematics: A meta-analysis. Psychological Bulletin, 136, 103–127. Fredricks, J. A., & Eccles, J. S. (2002). Children’s competence and value beliefs from childhood through adolescence: Growth trajectories in two male-sex-typed domains. Developmental Psychology, 38, 519–533. Ganley, C. M., & Lubienski, S. T. (2016). Mathematics confidence, interest, and performance: Examining gender patterns and reciprocal relations. Learning and Individual Differences, 47, 182–193. Hattie, J. A. C. (2009). Visible learning: A synthesis of over 800 meta-analyses relating to achievement. Routledge. He, J., Barrera-Pedemonte, F., & Buchholz, J. (2019). Cross-cultural comparability of noncognitive constructs in TIMSS and PISA. Assessment in Education: Principles, Policy & Practice, 26, 369–385. He, J., & Van de Vijver, F. (2012). Bias and equivalence in cross-cultural research. Online Readings in Psychology and Culture., 2. https://doi.org/10.9707/2307-0919.1111 Hooper, M., Mullis, I. V. S., Martin, M. O., & Fishbein, B. (2017). TIMSS 2019 context questionnaire framework. In I. V. S. Mullis & M. O. Martin (Eds.), TIMSS 2019 assessment frameworks (pp. 57–78). TIMSS & PIRLS International Study Center.

46

Student Motivation and Self-Beliefs

1321

Ker, H.-V. (2017). The effects of motivational constructs and engagement on mathematics achievements: A comparative study using TIMSS 2011 data of Chinese Taipei, Singapore, and the USA. Asica Pacific Journal of Education, 37, 135–149. Kuger, S., & Klieme, E. (2016). Dimensions of context assessment. In S. Kuger, E. Klieme, N. Jude, & D. Kaplan (Eds.), Assessing contexts of learning: An international perspective. Springer International Publishing. Lee, J. (2009). Universals and specifics of math self-concept, math self-efficacy, and math anxiety across 41 PISA 2003 participating countries. Learning and Individual Differences, 19, 355–365. Lee, J., & Stankov, L. (2018). Non-cognitive predictors of academic achievement: Evidence from TIMSS and PISA. Learning and Individual Differences, 65, 50–64. Liem, G. A. D., & McInerney, D. M. (Eds.). (2018). Big theories revisited 2. Information Age Publishing. Liou, P.-Y. (2017). Profiles of adolescents’ motivational beliefs in science learning and science achievement in 26 countries: Results from TIMSS 2011 data. International Journal of Educational Research, 81, 83–96. Marsh, H. W. (2006). Self-concept theory, measurement and research into practice: The role of selfconcept in Educational Psychology. British Psychological Society. Marsh, H. W., & Craven, R. G. (2006). Reciprocal effects of self-concept and performance from a multidimensional perspective: Beyond seductive pleasure and unidimensional perspectives. Perspectives on Psychological Science, 1, 133–163. McInerney, D. M., & Van Etten, S. (2004). Big theories revisited. Connecticut, Information Age Publishing. Meece, J. L., Glienke, B. B., & Burg, S. (2006). Gender and motivation. Journal of School Psychology, 44, 351–373. Michaelides, M. P., Brown, G. T. L., Eklöf, H., & Papanastasiou, E. C. (2019). Motivational profiles in TIMSS mathematics. Exploring student clusters across countries and time (IEA research for education) (Vol. 7). Springer Open. Mullis, I. V. S., Martin, M. O., Smith, T. A., Garden, R. A., Gregory, K. D., Gonzalez, E. J., Chrostowski, S. J., & O’Connor, K. M. (2001). TIMSS asssessment frameworks and specifications 2003. Chestnut Hill. Murphy, P. K., & Alexander, P. A. (2000). A motivated look at motivational terminology. Contemporary Educational Psychology, 25, 3–53. Nagengast, B., & Marsh, H. W. (2011). The negative effect of school-average ability on science self-concept in the UK, the UK countries and the world: The big-fish-little-pond effect for PISA 2006. Educational Psychology, 31, 629–656. Nagengast, B., & Marsh, H. W. (2013). Motivation and engagement in science around the globe: Testing measurement invariance with multigroup structural equation models across 57 countries using PISA 2006. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.), Handbook of international large-scale assessment: Background, technical issues, and methods of data analysis (pp. 318–344) Chapman and Hall xxx. Nagengast, B., Marsh, H. W., Scalas, L. F., Xu, M. K., Hau, K.-T., & Trautwein, U. (2011). Who took the “x” out of the expectancy-value theory? A psychological mystery, a substantive-methodological synergy and a cross-national generalization. Psychological Science, 22, 1058–1066. OECD. (2003). PISA 2003 assessment framework: Mathematics, Reading, science and problem solving knowledge and skills. OECD Publishing. OECD. (2019). PISA 2018 assessment and analytical framework. OECD Publishing. Oskarsson, M., Kjaernsli, M., Sørensen, H., & Eklöf, H. (2018). Nordic students interest and selfbelief in science. In Nordic council of ministers, northern lights on TIMSS and PISA 2018. Nordic Council of Ministers. Papanastasiou, E. C., & Zembylas, M. (2004). Differential effects of science attitudes and science achievement in Australia, Cyprus, and the USA. International Journal of Science Education, 26, 259–280.

1322

H. Eklo¨f

Park, Y. (2011). How motivational constructs interact to predict elementary students’ reading performance: Examples from attitudes and self-concept in reading. Learning and Individual Differences, 21, 347–358. Pekrun, R. (2018). Control-value theory: A social-cognitive approach to achievement emotions. In G. A. D. Liem & D. M. McInerney (Eds.), Big theories revisited 2. Information Age Publishing. Pintrich, P. R. (2003). Motivation and classroom learning. In W. M. Reynolds & G. E. Miller (Eds.), Handbook of psychology: Educational psychology, Vol. 7 (pp. 103–122). John Wiley & Sons. Rutkowski, L. (2016) (ed.). Special issue on causal inferences with cross-sectional large-scale assessment data. Large-scale Assessments in Education, 4(8). Retrieved from https://www. springeropen.com/collections/LsAE Ryan, R. M., & Deci, E. L. (2000). Intrinsic and extrinsic motivation: Classic definitions and new directions. Contemporary Educational Psychology, 25, 54–67. Scheerens, J. (1990). School effectiveness and the development of process indicators of school functioning. School Effectiveness and School Improvement, 1, 61–80. Scherer, R., & Nilsen, T. (2016). The relations among school climate, instructional quality, and achievement motivation in mathematics. In T. Nilsen & J.-E. Gustafsson (Eds.), Teacher quality, instructional quality and student outcomes (IEA research for education 2). Springer Open. Schiefele, U., Schaffner, E., Möller, J., & Wigfield, A. (2012). Dimensions of reading motivation and their relation to reading behavior and competence. Reading Research Quarterly, 47, 427–463. Schunk, D. H., Pintrich, P. R., & Meece, J. L. (2010). Motivation in education: Theory, research, and applications (3rd ed.). Pearson Education. Sjöberg, S., & Schreiner, C. (2010). The ROSE project. An overview and key findings. Oslo University. Retrieved from http://www.cemf.ca/%5C/PDFs/SjobergSchreinerOverview2010.pdf Stankov, L. (2013). Noncognitive predictors of intelligence and academic achievement: An important role of confidence. Personality and Individual Differences, 55, 727–732. Tonks, S. M., Wigfield, A., & Eccles, J. S. (2018). Expectancy-value theory in cross-cultural perspective: What have we learned in the last 15 years? In G. A. D. Liem & D. M. McInerney (Eds.), Big theories revisited 2. Information Age Publishing. Valentine, J. C., DuBois, D. L., & Cooper, H. (2004). The relation between self-beliefs and academic achievement: A meta-analytic review. Educational Psychologist, 39, 111–133. Wentzel, K. R., & Wigfield, A. (2009). Handbook of motivation in school. Routledge. Wigfield, A., & Eccles, J. S. (2002). Development of achievement motivation. In San Diego. Academic Press. Yang, G., Badri, M., Al Rashedi, A., & Almazroui, K. M. (2018). The role of reading motivation, self-efficacy, and home influence in students’ literacy achievement: A preliminary examination of fourth graders in Abu Dhabi. Large-Scale Assessments in Education, 6, 1–19.

Well-Being in International Large-Scale Assessments

47

Francesca Borgonovi

Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What Is Child Well-being? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Criticism of Measuring Well-being in ILSAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Including Well-being in ILSAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cognitive Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Psychological Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Social Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Material: Economic Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Physical Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What Can Be Learnt from ILSAs About Children’s Well-being? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What Can Be Learnt from ILSAs About Adults’ Well-being? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions and Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1324 1326 1327 1329 1329 1330 1331 1334 1336 1337 1340 1341 1342

Abstract

Over the past 25 years, International Large-Scale Assessments (ILSAs) have become important elements in support of the development of educational policy and reform. League tables of students’ achievement have sparked intense debates among educators, researchers, and policy-makers over the quality of specific education systems, often in response to intense media scrutiny. However, until recently, the dialogue arising from ILSAs ignored aspects beyond academic achievement, such as students’ social, psychological, physical, and material well-being. Lack of coverage in ILSAs of broader aspects of students’ wellbeing has become one of the reasons why ILSAs have been criticized by many teachers, school leaders, education professionals, students, and their families, as well as policy-makers. As a result, many ILSAs have progressively started to F. Borgonovi (*) University College London, London, UK e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_45

1323

1324

F. Borgonovi

include instruments designed to assess students’ well-being, and indicators have been developed to assist the accumulation of a solid knowledge base on factors that are associated with the development of different dimensions of students’ well-being. Five dimensions of well-being are considered: cognitive, social, psychological, physical, and material. The chapter first maps which well-being indicators are available in ILSAs, how inclusion evolved over time, and how inclusion differs across assessments. The chapter then reviews findings from academic and policy-oriented research based on well-being data in ILSAs, with a focus on evidence that considers the interaction between the cognitive dimension of well-being and the other dimensions (social, psychological, physical, and material dimensions), detailing the value added for research and policy. Keywords

Well-being · Social · Psychological · Material · Physical · Sense of belonging · Life satisfaction · PISA · TIMSS · PIRLS · ILSA

Introduction Over the past 25 years, International Large-Scale Assessments (ILSAs) have become important elements in support of the development of education policy and education reform (Egelund, 2008; Ertl, 2006; Grek, 2009; Takayama, 2008). League tables of students’ achievement have sparked intense debate among educators, researchers, and policy-makers over the quality of different education systems. School-level ILSAs have been used to benchmark progress, set standards and foster policy learning across different education systems (Breakspear, 2012). Similarly, assessments of adult populations have been used as a source of evidence on the extent to which education and training systems promote good labor market outcomes for individuals and societies through effective transitions from school to work, vocational education and workplace learning (Martin, 2018). Just as ILSAs have gained in visibility and use, they have also started to attract an increasing level of criticism, a criticism that has been especially directed towards school-based assessments in general, and the Program for International Student Assessment (PISA) in particular. Many critics lament that, by focusing effort and attention on achievement measures in specific academic domains, ILSAs fail to portray the extent to which education systems promote the wide range of skills and competences children need to master to be successful in their daily lives and in their future (Meyer et al., 2014). Even more fundamentally, critics argue that ILSAs change the way in which education systems operate by redirecting time, effort, and resources to the narrow maximization of what is being measured in such assessments (Auld & Morris, 2016; Labaree, 2014). Critics often concentrate their arguments on PISA because the Organization for Economic Cooperation and Development (OECD), which is responsible for promoting PISA, is not only involved in developing the assessment but also in providing policy advice to education

47

Well-Being in International Large-Scale Assessments

1325

policy-makers on education reforms (Sellar & Lingard, 2013; Volante & Ritzen, 2016; Zhao, 2020). Therefore, the OECD is often considered as especially powerful in promoting a specific vision of educational success. Longitudinal studies suggest that students’ results on the PISA test, and tests like PISA, are correlated with how well students will do later on in life in terms of educational attainment and labor market participation (OECD, 2010, 2012a; Borgonovi, Ferrara & Piacentini, 2021), as well as with and their empirical or selfreported grades (OECD, 2010, 2012b). Yet, in support of the fact that students’ results on standardized tests do not portray a complete picture of their academic achievement or their school systems’ quality, strong performance in standardized assessments has been found to explain only so much of how well students will do in terms of educational attainment or early labor market entry (OECD, 2018a; Stankov, 1999; Sternberg, 1995). Employment and full participation in society require much more than just proficiency in the information processing abilities typically measured in ILSAs (Levin, 2012). Furthermore, there is evidence to date on the extent to which students who have higher achievement in ILSAs during childhood and adolescence enjoy better health and well-being outcomes ad adults, although longitudinal followups of PISA participants that were conducted in Denmark, Canada, Australia, Switzerland, Uruguay, Singapore, and the USA could be used to that effect (see Cardozo, 2009; LSAY, 2014; Maehler & Konradt, 2020; OECD, 2010; Ríos González, 2014; Rosdahl, 2014; Scharenberg et al., 2014). The interest in the extent to which achievement measures contained in ILSAs can be considered reliable proxies of the broad well-being of individuals and, if not, the extent to which ILSAs should incorporate other well-being measures and what these should be, mirrors discussions undergone among economists on the value of Gross Domestic Product (GDP). Traditional economic measures such as GDP were considered to be good indirect proxies of well-being rather than end goals in themselves. However, in recent years, researchers and policy-makers have begun to promote the adoption of direct measures of well-being when assessing the efficiency of different policy interventions (see CAE, 2011, also known as the final report of the StiglitzSen-Fitoussi Commission on the Measurement of Economic Performance and Social Progress), because it has been observed that GDP or measures of economic resources alone do not adequately represent other aspects of human well-being and behavior. In this context, evidence from ILSAs has been used in international benchmarking efforts conducted by the Organization for Economic Cooperation and Development (OECD), the United Nations Children’s Fund (UNICEF), and The United Nations Educational, Scientific and Cultural Organization (UNESCO) to derive international comparisons of direct well-being measures (Adamson, 2013; OECD, 2020; UNESCO, 2016). Such studies mainly relied on the aggregate country-level indicators and were based on multiple sources of data. Data on achievement from ILSAs such as the OECD’s PISA and Program for International Assessment of Adult Competences (PIAAC) studies, as well as the Progress in International Reading and Literacy Study (PIRLS), the Trends in International Mathematics and Science Study (TIMSS) were used in these studies in conjunction with other data sources on individuals’ physical and mental health, self-reported

1326

F. Borgonovi

happiness and satisfaction with life, poverty, the quality of social relationships, and civic engagement. What critics of standard economic approaches focused on maximizing individual measures such as GDP and critics of achievement measures in ILSAs) have in common is the recognition that individual well-being is multifaceted, that tradeoffs can exist between maximizing different dimensions of well-being and that no individual indicator can capture the complexity of human experiences (CAE, 2011). Although the media reporting and policy dialog that arise from the analysis of results from ILSAs often continue to focus on academic achievement as the only aim of education systems, there is a growing recognition among experts involved in the development of ILSAs that well-being represents an equally important goal (Levin, 2012; UNESCO, 2016). As a result, in recent years innovations have been promoted in the context of ILSAs affecting not only to the choice and definition of the competency domains covered in the tests, but also the inclusion of measures aimed at capturing the broader well-being of participants. This chapter first defines how the well-being of children can be conceptualized and develops operational definitions of well-being applicable to the population covered in school-based ILSAs. The chapter then reviews the availability of indicators of well-being in ILSAs) covering children and covering adults, with the aim of considering what lessons and indicators appear particularly promising for transferability from adult settings to school-based assessments. The chapter then reviews research on well-being indicators based on ILSAs and considers implications for research, policy, and practice. The chapter focuses on large-scale assessments that cover more than one country and one world region (i.e., are international), that contain performance assessments (i.e., they contain a test component designed to assess ability in a domain), and that are large-scale (i.e., aim to be representative of broad populations with precision and therefore have large samples). The primary focus is on school-based assessments (i.e., assessments of students in schools, irrespective of whether the target population is defined in terms of age or grade attended). School-based assessments constitute the majority of ILSAs) involving children. As such, these assessments typically do not consider some of the most disadvantaged children, that is, children who have already left school at the relevant age/grade.

What Is Child Well-being? Two perspectives have been developed to characterize child well-being: The first is the developmental perspective, which underscores the importance of promoting child well-being as a way to promote good outcomes for the adults of the future (Bronfenbrenner, 1979). According to the developmental perspective, promoting the well-being of children is important because it is the mean through which it is possible to promote the well-being of adult populations. The second perspective is the children’s rights perspective, which recognizes children as human beings, and maintains that focusing on the well-being of children is inherently important because

47

Well-Being in International Large-Scale Assessments

1327

their struggles, suffering, happiness, and success is just as important as those of adult populations (Ben-Arieh, 2010; Casas, 1997). The key difference between the two perspectives is that, while the developmental perspective is instrumental (i.e., children are valued as prospective adults), the children’s rights perspective considers child well-being to be intrinsically valuable. In the context of ILSAs, the debate over the instrumental and the intrinsic value of measuring well-being pertains not only the time perspective (valuing the present child vs. valuing the future adult), but also why well-being matters. Among education specialists, the rationale that is typically presented for including wellbeing measures is that well-being promotes academic achievement while its absence reduces the likelihood that children will develop strong skills and competences (Basch, 2011; Burns & Rapee, 2006; Ferguson & Power, 2014; Ontario Ministry of Health Promotion, 2010; Public Health England, 2014; Vessey & McGowan, 2006). By contrast, policy analysts and researchers from other fields, such as public health, labor economics and social policy, highlight the role played by the academic achievement of children in shaping the broad well-being of adults, for example, by promoting good physical and mental health, social connectedness, a good income and stable employment prospects (Gordon, 2005; Layard, Clark, Cornaglia, Powdthavee, & Vernoit, 2014; Schonert-Reichl, Smith, Zaidman-Zai, & Hertzman, 2012). Finally, those who embrace the new well-being agenda consider different aspects of well-being to be important, recognizing that different individuals or different institutions might assign a greater value to different well-being dimensions. In this chapter, well-being is described as follows (Borgonovi & Pál, 2016, p. 8): (. . .) a dynamic state characterised by students experiencing the ability and opportunity to fulfil their personal and social goals. It encompasses multiple dimensions of students’ lives, including: cognitive, psychological, physical, social and material. It can be measured through subjective and objective indicators of competencies, perceptions, expectations and life conditions. This definition emphasizes the multidimensionality of students’ well-being, which encompasses both students’ states and outcomes at a specific age group, as well developmental processes that may act as risk or protective factors shaping wellbeing in later life. Figure 1 describes the five dimensions of well-being considered in this chapter.

Criticism of Measuring Well-being in ILSAs Even among researchers, policy-makers, and test developers who are convinced of the importance of measuring different aspects of well-being in the context of ILSAs, two criticisms have been raised: The first is that the introduction of wellbeing measures will increase the burden of participating in ILSAs for respondents and, as a result, might compromise the overall quality of collected data (Martin, Rust, & Adams, 1999; Mislevy, Beaton, Kaplan, & Sheehan, 1992; OECD, 2013a, p. 191). Increases in response burden may take two forms: first, assuming

1328

F. Borgonovi

Fig. 1 Well-being dimensions in ILSAs Psychological

Material

Cognive

Physical

Social

that well-being measures will be additional to, rather substitute for, other measures, they run the risk of increasing the time required to participate in a study. Given the limited attention span of children and survey participants in general and the fact that the tests are already long and cognitively demanding, fatigue effects are a concern, and there is unwillingness to increase the length of the questionnaires. Therefore, making questionnaire space for well-being indicators comes at the expense of other indicators and/or methodological innovations. For example, electronic delivery of questionnaires and assessments and statistical analyses have been used to create efficient sequencing of material and rotation designs that allow to extend questionnaires without increasing the burden for individual respondents (Adams, Lietz, & Berezner, 2013; Kaplan & Su, 2018; von Davier, 2014). Second, some of the questions that are typically implemented in well-being modules, such as height and weight, feelings of happiness or unhappiness, may be perceived as sensitive or intrusive by participants and therefore might lower their motivation to take part in the study (reducing participation rates) or lower the motivation they put while participating (increasing item non-response and/or careless responding) (Hopfenbeck & Kjærnsli, 2016). This criticism pertains to the validity, reliability, and comparability of well-being indicators given the limited amount of time that can be devoted to their measurement in the context of ILSAs. The unique feature that distinguishes ILSAs from other surveys is that they provide indicators of information processing abilities by administering achievement tests to participants. In order to guarantee the validity, reliability, and comparability of the achievement domains, ILSAs devote most of the survey time to the administration of the tests while the background questionnaires receive considerably less time. As a result, only a limited set of constructs can be administered during the questionnaire, and only few questions can be administered to capture each construct.

47

Well-Being in International Large-Scale Assessments

1329

Including Well-being in ILSAs Cognitive Dimension The cognitive dimension of well-being covered in ILSAs identifies both levels of subject-specific skills and knowledge students have acquired, as well as their selfbeliefs in, and dispositions towards those subjects (Mullis & Martin, 2013, 2015; OECD, 2017). ILSAs can be categorized into studies designed to capture curricular aspects of students’ learning (for example, TIMSS) and studies that are designed to capture how students apply subject knowledge to real-life problems and situations (for example, PISA). ILSAs can also be classified according to the degree of specificity/comprehensiveness of the assessment and how much domain coverage has evolved over time. Most ILSAs are specific and stable, that is, only one subject/competence domain is assessed, and the same domain is assessed in different cycles. For example, as illustrated in Table 1, PIRLS only assesses reading literacy, the International Civic and Citizenship Study (ICCS) only assesses civic and citizenship and the International Computer and Information Literacy Study (ICILS) assesses computer and information literacy. The coverage of TIMSS is broader since it assesses both mathematics and science. PISA, by contrast, measures reading literacy, mathematTable 1 Indicators of the cognitive well-being dimension in ILSAs TIMSS

PISA 2000; 2003; 2006; 2009; 2012; 2015; 2018 Digital reading 2009; 2012 (optional). Digital main instrument since 2015.

Mathematics

Grades 4 and grade 8. 1995; 1999; 2003; 2007; 2011; 2015; 2019.

Science

Grades 4 and grade 8. 1995; 1999; 2003; 2007; 2011; 2015; 2019.

2000; 2003; 2006; 2009; 2012; 2015; 2018 Digital math 2012 (optional) digital main instruments since 2015. 2000; 2003; 2006; 2009; 2012; 2015; 2018. Digital instruments since 2015. 2003; 2012 (digital) 2012; 2015; 2018. Digital instruments since 2015. All editions (matching main domain tested).

Reading

PIRLS 2001; 2006, 2011, 2016 Digital reading 2016

Problem solving Financial literacy Subject-specific self-beliefs and motivation

All editions

All editions

Note: For PISA the year in which a domain was the main assessment domain is denoted in bold

1330

F. Borgonovi

ical literacy and science literacy in all cycles, as well as additional domains which have progressively broadened the scope of the assessment over the years. Additional assessment domains covered in PISA included so far are: Problem Solving (in 2003 and 2012), Financial Literacy (2012, 2015 and 2018), Collaborative Problem Solving (2015), and Global Competence (2018). In addition to the cognitive domains, ILSAs measure students’ self-beliefs, which comprise a range of students’ attitudes and dispositions towards learning such as subject-specific motivation and self-efficacy using questionnaires. Typically, in the context of ILSAs, self-beliefs and motivation constructs include subject-specific self-efficacy, self-concept, and the perceived value of the subject domain for intrinsic and extrinsic purposes. In 2018, PISA departed from the established practice of examining subject-specific motivational and efficacy beliefs in ILSAs by including indicators of general feelings of competence as well as their fear of failure and beliefs over the malleability of intelligence (Dweck, 2006).

Psychological Dimension The psychological dimension of students’ well-being describes students’ selfreported psychological functioning and covers life satisfaction—students’ selfevaluations about their lives—as well as the goals and ambitions they have for their future (OECD, 2013b). Life satisfaction refers to evaluations made by individuals over their perceived quality of life overall (Shin & Johnson, 1978). Crucially, the criteria for defining satisfaction with life are based on personal standards (Neto, 1993), rather than upon pre-defined criteria defined by individuals outside the respondent (Diener, 1984; Neto, 1993). In the context of ILSAs, the introduction of questions on life satisfaction allow children to express their voice and judge their own lives based upon their own standards and opinions. Among teenagers, high levels of life satisfaction are associated with positive physical and cognitive development, social, and coping skills that lead to more positive outcomes in adulthood (Currie et al., 2012). As illustrated in Table 2, PISA first integrated a measure of life satisfaction in the main student questionnaire in its 2015 edition [Overall, how satisfied are you with your life as a whole these days? Slidebar with 0–10 (not at all satisfied, completely satisfied)] and the question was maintained in 2018. In the optional well-being questionnaire that was administered in PISA 2018, participants were asked to indicate satisfaction with specific domains of their lives, including physical aspects, social aspects, the neighborhood in which they live, the relationships they have with their parents, teachers and their life at school. No comparable measure was administered in PIRLS, TIMSS (grade 4 or 8), PIRLS, ICILS, or ICCS. In 2018, PISA integrated the broad measure of life satisfaction with more specific questions aimed at identifying students’ reported meaning and purpose in life—also referred to as eudaemonia (OECD, 2013b). PISA 2018 introduced indicators of eudaemonia in the main student questionnaire in 2018, no other PISA cycle and no other school-level ILSA included indicators of eudaemonia. Students were asked to

47

Well-Being in International Large-Scale Assessments

1331

Table 2 Indicators of the psychological well-being dimension in ILSAs Life satisfaction Satisfaction with specific life domains Eudaemonia Affect

PISA 2015; 2018 2018 (optional well-being questionnaire) 2018 general 2018 general

TIMSS

PIRLS

2015 (g4 and g8): Pride 2003 onwards (g4 and g8): Liking being in school/ feeling safe 1999 g8: Importance of having fun (self, friends, mother)

2016: Pride All editions: Liking being in school/ feeling safe

report how much they agreed or disagreed that their life has a clear meaning or purpose, that they have discovered a satisfactory meaning in life, that they have a clear sense of what gives meaning to their life. The final component of subjective well-being is affect, that is, the extent to which students experience certain emotions and moods, usually at a particular point in time (Watson, Clark, & Tellegen, 1988). Together with life satisfaction and eudaemonia, general affect is one of the three measures of subjective well-being included in the PISA 2018 student questionnaire. Affect dimensions covered in PISA 2018 include feeling happy, scared, lively, miserable, proud, afraid, joyful, sad, and cheerful. In the context of PISA 2018 students were asked to report how frequently (“never,” “rarely,” “sometimes,” and “always”) they feel happy, lively, proud, joyful, cheerful, scared, miserable, afraid and sad. Although TIMSS and PIRLS do not include domain-general questions of affect, they include indicators of affect directed towards school in general or their school in particular. For example, both TIMSS and PIRLS include indicators of the extent to which participating students feel proud to go to their school (TIMSS 2015 both grades and PIRLS 2016), like being in school and feel safe when they are at school (all editions). Furthermore, in 1999, TIMSS included questions on the importance of having time to have fun using a 4-point Likert scale ranging from “strongly agree” to “strongly disagree.” Participating students were asked to report not only their own views on the importance of having fun, but also the views of their mother and of most of their friends.

Social Dimension The social dimension of well-being has been captured in ILSAs through indicators of sense of belonging to the school community, friendship networks, bullying behavior, and relationships with teachers. Indicators of social well-being typically captured in ILSAs have been fundamental to the development of research on school climate (see

1332

F. Borgonovi

Table 3 Social dimension indicators in ILSAs Sense of belonging Relationships with teachers

Bullying Social connectedness— Friends Support if psychological distressed

PISA 2000, 2003, 2012, 2015, and 2018 2000, 2003, 2009, and 2012 (fairness) 2015 victimization 2018 teacher support and encouragement 2015 and 2018 comparable, limited to 3 items 2015; 2018 (optional wellbeing questionnaire)

TIMSS 2003, 2007, 2011, and 2015 2015 teacher fairness

PIRLS 2006 and 2016 2016 teacher fairness

Add editions, both grades 1995; 1999; 2003; 2007

All editions

2018 (optional well-being questionnaire)

Wang & Degol, 2016 for a comprehensive review). Such research is anchored in a bio-ecological framework of human development (Bronfenbrenner, 1979) that considers children’s development to be shaped by their social context and environment. Individual responses on social well-being are considered both as outcomes in their own right and are aggregated to construct indicators of school climate. TIMSS contained a question on the amount of time children spent playing or talking with friends outside of school in 1995, 1999, 2003, and 2007, and PISA contained a similar question in its 2015 and 2018 edition. No comparable information is available for PIRLS, in TIMSS after 2007 rounds and early PISA rounds. Collaboration with others for school purposes and discussion about reading can signal social integration and social well-being. TIMSS and PISA also contain a series of questions on the extent to which students collaborate with others either in class or after school to work on science and math topics, and PISA contains questions on the extent to which the student discusses what he or she is reading with friends or family. Lack of social well-being has been captured in most TIMSS and PIRLS editions through indicators of bullying behavior, while PISA introduced questions on bullying only in its 2015 and 2018 editions. Conducting trend analyses or comparing responses across the three surveys is therefore difficult because of differences in question wording and framing across studies and across editions within each study. Bullying is a specific type of aggressive behavior that involves unwanted, negative actions in which someone intentionally and repeatedly harms and discomforts another person who has difficulty defending himself or herself (Olweus, 1993). Bullying involves the exercise of power on the part of the bully towards the victim (Woods & Wolke, 2004) and can take a physical (hitting, punching, and kicking), verbal (name-calling and mocking) and relational form (spreading gossip and

47

Well-Being in International Large-Scale Assessments

1333

engaging in other forms of public humiliation, shaming, and social exclusion; Woods & Wolke, 2004). Cyberbullying, harassment that takes place through digital devices and tools, is also a form of bullying (Smith et al., 2008). The early editions of the TIMSS and PIRLS questionnaires probed students to report physical forms of bullying, while in more recent editions, these were complemented with questions on verbal and relational bullying as well as bullying behavior occurring online and social media. PISA included questions on all forms of bullying in 2015 and 2018, although it made explicit the possibility of bullying behavior occurring online only in the 2018 edition. Another important indicator of social well-being is sense of belonging. Sense of belonging to the school community represents the extent to which students feel accepted, respected, included, and supported by their school community (i.e., peers, teachers, and other adults) (Goodenow & Grady, 1993). It is a key dimension of emotional engagement with the school (Fredricks, Blumenfeld, & Paris, 2004) and is positively associated with academic outcomes and overall well-being (Fredricks et al., 2004; Sánchez, Colón, & Esparza, 2005). Among students, sense of belonging is characterized by a need for regular contact with peers, by a perception of an ongoing, affective relationship with other students and teachers (Baumeister & Leary, 1995). Lack of social connectedness may lead not only to social isolation and feelings of loneliness, with a negative effect on students’ overall well-being, but also to alienation from the education system, the values it proposes and the objectives it pursues. It may lead to dropping out from the education system, lower levels of motivation, persistence, and self-regulation, particularly among disadvantaged student populations (Becker & Luthar, 2002). In PISA, sense of belonging to the school community in general was assessed in 2000, 2003, 2012, 2015, and 2018 using a 4-point Likert scale ranging from “strongly disagree” to “strongly agree” on the extent to which students felt that their school was a place where: “I feel like an outsider (or left out of things);” “I make friends easily;” “I feel like I belong;” “I feel awkward and out of place;” “other students seem to like me;” “I feel lonely.” TIMSS included questions on the extent to which students reported liking being in school in its 2003, 2007, 2011, and 2015 edition while it also included questions on feeling safe while at school and feeling like they belong at school in its 2011 and 2015 editions. PIRLS included questions on the extent to which students reported liking being in school, feeling safe in 2006 and 2016. In 2016, PIRLS also included questions on the extent to which students reported feeling like they belong in school and are proud of their school in 2016, while in 2006, it included questions on the extent to which students showed respect towards each other, care about each other and help each other with homework. In the optional well-being questionnaire that was administered in 2018, students participating in the PISA study were asked to report the number of close friends they had and the number of days during the week in which they spent time with their friends, the frequency of contact with their friends through text messages and social media and the extent to which their friends were well accepted by their parents. Students were also asked to report if they had support from someone close in case something was bothering them and if they had positive support from their parents.

1334

F. Borgonovi

Finally, information was sought in the three studies on students’ relationship with teachers. PIRLS in 2016 included questions on the extent to which students felt that their teachers are fair to them. Teacher fairness was also assessed in PISA in 2000, 2003, 2009, and 2012 and in TIMSS 2015. In 2000, 2003, 2009, and 2012, PISA students were asked to report using a 4-point Likert scale on how much they disagreed or agreed that most teachers are interested in students’ well-being, that most of their teachers really listen to what they have to say and that they will receive extra help from their teachers if needed. In 2015, students were also asked to report on how victimized they felt by reporting how frequently they felt that, in the year prior to the interview, their teachers called on them less often than they called on other students; graded them harder than they graded other students, gave them the impression that they thought they were less smart than they really were, disciplined them more harshly than other students, ridiculed them in front of others, or said something insulting to them in front of others. In 2018, several questions reflected students’ perceptions of their teacher’s interest in them and characterized their relationship with their teachers. In particular, a number of questions were administered to gauge teacher’s beliefs and encouragement towards their students. For example, students were asked to report how much they agreed that their teacher made them feel confident in their ability to do well in the course, listened to their view on how to do things, felt that their teacher understood them, the teacher liked teaching them, was enthusiastic, showed enjoyment in teaching, posed questions that motivate students to participate actively, encouraged students to express their opinion about a text.

Material: Economic Dimension Socio-economic disparities in academic achievement have attracted the attention of researchers and policy-makers since the 1960s (see, for example, Coleman et al., 1966; and comprehensive reviews, such as White, 1982; McLoyd, 1998; Buchmann, 2002; Sirin, 2005). Given the policy relevance in examining how socio-economic status (SES) relates to educational attainment and achievement among school-aged children, all ILSAs include measures of SES, although no consensus has emerged on the conceptual meaning of socio-economic status or on how best it can be measured through reports of children, especially in PIRLS and TIMSS grade 4 when they are very young. For example, studies where number of books (PIRLS), parental educational attainment and occupational status (PISA) were reported by participating students as well as by their parents in ILSAs indicate that, for a sizeable number of students, there was a discrepancy (Rutkowski & Rutkowski, 2010). Countries with the lowest agreement for the number of books also tend to be lower achievers and less economically wealthy than those with higher correlations (Rutkowski & Rutkowski, 2013), suggesting that accuracy in reporting could differ across different groups thus influencing comparisons. Different ILSAs include different variables or different combinations of variables to describe social class, poverty and affluence, or a student’s or a student’s family

47

Well-Being in International Large-Scale Assessments

1335

Table 4 Indicators of the material well-being dimension in ILSAs Number of books at home

PISA All editions

Parental educational attainment

All editions

Parental occupational status Household resources Paid job

All editions All editions 2015, 2018

Job at home

2015, 2018

Born in country of the test/parents born in country of the test

All editions (in some countries information on exact country of birth is asked for large immigrant groups). Age at arrival. All editions (in some countries information on exact language spoken for large language minority groups). 2018 frequency of use of different languages with different individuals. 2018

Language spoken at home

Worry about family finances

TIMSS All editions, all grades All editions, grade 8

PIRLS All editions

All editions 1999 and following 1999 and following All editions (plus age at arrival)

All editions

All editions (plus age at arrival)

ranking on the social ladder (see Table 4 for a review). In general, all ILSAs recognize that students’ material well-being should capture not only the lack deprivation in an absolute sense, but also reflect relative deprivation, i.e., how well off a student is compared to other students. One of the consequences of the diverse conceptualization and measurement approaches of SES in ILSAs is that empirical estimates of socio-economic disparities in academic achievement can vary greatly across studies. As highlighted in Table 4, the only indicator that is available across all studies is the number of books present in the student's home. In TIMSS since 1995 and PIRLS since 2001, grade-4 students were asked to report the number of books available in their home as well as a range of resources. They were also asked to report if they and their parents were born in the country in which they sat the test but not in which country this was. Students who were born outside the country of the test were asked to report the age at which they arrived in the country. TIMSS grade 8 students were additionally asked to report how often they spoke the language of the test at home and the level of education obtained by their mother (or female guardian) and father (or male guardian). Since the first edition in 2000, students participating in PISA were asked to report information on their parent’s educational attainment, the possession of resources (a considerably

1336

F. Borgonovi

larger list compared to TIMSS and PIRLS and including educational, cultural and home durables), and occupational status (through open-ended questions, which are therefore difficult and expensive to code) to construct an aggregate indicator of economic, social, and cultural status (ESCS). The aim of the index is to provide an absolute measure of SES that is comparable within each country across different population groups, while also being comparable over time and across countries. A number of studies have questioned the extent to which the PISA ESCS index can successfully achieve this (Rutkowski & Rutkowski, 2013). Students were also asked to report information on the language they spoke most often at home, if different from the language of instruction. In PISA 2018 students were asked to report on the extent to which they spoke different languages with different individuals (mother, father, brother/sister, best friend, schoolmates). Furthermore, they were asked to report their country of birth and the country of birth of their parents, in case they or their parents were not born in the country in which they sat the test. Since 2015, PISA asked students to report if they provided unpaid work within the household or if they engaged in paid work. Similarly, since 1999, grade-8students participating in TIMSS were asked to report the number of hours they worked for pay. In 2018, PISA administered, as part of an ad hoc optional well-being questionnaire, questions aimed at identifying if students worried about family finances.

Physical Dimension The physical dimension of well-being is the dimension least covered in the context of ILSAs. Engaging in moderate and vigorous physical activity is beneficial for people’s general health (Bouchard, Blair, & Haskell, 2012). According to specialists, children should practice at least 60 min of moderate to vigorous physical activity per day (Strong et al., 2005), and at least 3 days of vigorous physical activity per week to strengthen their muscles and bones (Janssen & LeBlanc, 2010). Students should engage in moderate and vigorous physical activity through physical education (P.E.) classes at school and sports activities practiced outside of school. Just as schools’ academic curriculum promotes students’ academic skills and their ability and willingness to engage with problems requiring numeracy in the future, physical education aims at developing and promoting students’ physical competencies, healthy lifestyles, and students’ ability to transfer such skills and knowledge to perform in a range of physical activities (Bailey, 2006). Healthy habits developed during childhood often carry through to adulthood (Bailey, 2006). Table 5 indicates that in PISA 2015 and 2018, students were asked to report the frequency of physical activity in or outside of school as well as the extent to which physical education was part of the curriculum. Students were asked to report the number of days in which they engaged in moderate and vigorous physical activities as well as if they practiced exercise or sport either before or after going to school. In 2018, a well-being questionnaire was administered in a small set of countries. Students were asked to report their overall health (using four response categories

47

Well-Being in International Large-Scale Assessments

1337

Table 5 Indicators of the physical well-being dimension in ILSAs Time spent playing sports Health conditions Self-reported health

PISA 2015 and 2018 (moderate and vigorous physical activity) Physical education classes 2018 (well-being questionnaire)

TIMSS 1999 and following, both grades

PIRLS

2018 (well-being questionnaire)

ranging from excellent to poor), height and weight (as a way to identify body mass index), to express their body image, to report the frequency with which they experienced a number of physical and mental health conditions including headache, stomach pain, back pain, feeling depressed, feeling irritable or having a bad temper, feeling nervous, having difficulties falling asleep, feeling dizzy or feeling anxious. Grade-4 and grade-8 students participating in TIMSS were asked to report the amount of time spent playing sports, a healthy lifestyle during a school day before or after school (see Table 5). In 1999, students could indicate that they spent no time, 20000101)

Search 2: Migration Only 5 242 Results (noft(pirls OR pisa OR timss OR programme for international student assessment OR progress in international reading literacy study OR trends international mathematics AND science study) AND noft(migration OR ethnicity OR ethnic background OR ethnic OR native OR non-native OR minority) AND noft(achievement OR career aspiration OR motivation)) AND stype.exact(“Scholarly Journals”) AND at.exact(“Article”) AND la.exact(“English”) AND PEER(yes) AND pd (>20000101) Search 3: SES Only 5 848 Results (noft(socioeconomic status OR socio-economic status OR social class OR social status OR income OR disadvantaged OR poverty OR socioeconomic background OR socio-economic background OR social background OR social inequality OR socioeconomic inequality OR socio-economic inequality) AND noft(pirls OR pisa OR timss OR programme for international student assessment OR progress in international reading literacy study OR trends international mathematics AND science study) AND noft(achievement OR career aspiration OR motivation)) AND stype.exact(“Scholarly Journals”) AND at.exact(“Article”) AND la.exact (“English”) AND PEER(yes) AND pd(>20000101)

Appendix B Articles in the Final Round of Literature Review, Considered in the Chapter Articles discussed in this chapter which were identified through the literature search are found in Appendix B. Alivernini, F., & Manganelli, S. (2015). Country, school and students factors associated with extreme levels of science literacy across 25 countries. International Journal of Science Education, 37(12), 1992–2012. https://doi.org/10.1080/ 09500693.2015.1060648 Acosta, S. T., & Hsu, H. Y. (2014). Negotiating diversity: An empirical investigation into family, school and student factors influencing New Zealand adolescents’ science literacy. Educational Studies, 40(1), 98–115.

51

Family Socioeconomic and Migration Background Mitigating. . .

1481

Ammermueller, A. (2007). Poor background or low returns? Why immigrant students in Germany perform so poorly in the Programme for International Student Assessment. Education Economics, 15(2), 215–230. Andersen, I. G., & Jæger, M. M. (2015). Cultural capital in context: Heterogeneous returns to cultural capital across schooling environments. Social Science Research, 50, 177–188. https://doi.org/10.1016/j.ssresearch.2014.11.015 Arikan, S., Vijver, F., & Yagmur, J. (2017). PISA mathematics and reading performance differences of mainstream European and Turkish immigrant students. Educational Assessment, Evaluation and Accountability, 29(3), 229–246. Azzolini, D., Schnell, P., Palmer, J., Adserà, A., & Tienda, M. (2012). Educational achievement gaps between immigrant and native students in two “new” immigration countries: Italy and Spain in comparison. The ANNALS of the American Academy of Political and Social Science, 643(1), 46–77. Baker, D. P., Goesling, B., & Letendre, G. K. (2002). Socioeconomic status school quality and national economic development: A cross_national analysis of the heyneman-loxley effect on mathematics and science achievement. Comparative Education Review, 46(3), 291–312. Benito, R., Alegre, M. Á., & González-Balletbó, I. (2014). School segregation and its effects on educational equality and efficiency in 16 OECD comprehensive school systems. Comparative Education Review, 58(1), 104–134. Bol, T., Witschge, J., Van de Werfhorst, H. G., & Dronkers, J. (2014). Curricular tracking and central examinations: Counterbalancing the impact of social background on Borgna, C., & Contini, D. (2014). Migrant achievement penalties in Western Europe: Do educational systems matter? European Sociological Review, 30(5), 670–683. https://doi.org/10.1093/esr/jcu067 Brunello, G., & Rocco, L. (2013). The effect of immigration on the school performance of natives: Cross country evidence using PISA test scores. Economics of Education Review, 32(C), 234–246. Burger, K., & Walk, M. (2016). Can children break the cycle of disadvantage? Structure and agency in the transmission of education across generations. Social Psychology of Education, 19, 695–713. https://doi.org/10.1007/s11218-0169361-y Byun, S. Y., Schofer, E., & Kim, K. K. (2012). Revisiting the role of cultural capital in East Asian educational systems: The case of South Korea. Sociology of Education, 85(3), 219–239. Caro, D. H., Sandoval-Hernández, A., & Lüdtke, O. (2014). Cultural, social, and economic capital constructs in international assessments: An evaluation using exploratory structural equation modeling. School Effectiveness and School Improvement, 25(3), 433–450. https://doi.org/10.1080/09243453.2013.812568 Cheema, J. (2014). The migrant effect: An evaluation of native academic performance in Qatar. Research in Education, 91, 65–77. Chiu, M. M. (2015). Family inequality, school inequalities, and mathematics achievement in 65 countries: Microeconomic mechanisms of rent seeking and diminishing marginal returns. Teachers College Record, 117(1), 1–32.

1482

V. Rolfe and K. Yang Hansen

Chiu, M. M., & Chow, B. W. Y. (2015). Classmate characteristics and student achievement in 33 countries: Classmates’ past achievement, family socioeconomic status, educational resources, and attitudes toward reading. Journal of Educational Psychology, 107(1), 152. Chiu, M., & Mcbride-Chang, C. (2010). Family and reading in 41 countries: Differences across cultures and students. Scientific Studies of Reading, 14(6), 514–543. Chudgar, A., & Luschei, T. F. (2009). National income, income inequality, and the importance of schools: A hierarchical cross-national comparison. American Educational Research Journal, 46(3), 626–658. Chudgar, A., Luschei, T. F., & Zhou, Y. (2013). Science and mathematics achievement and the importance of classroom composition: Multicountry analysis using TIMSS2007. American Journal of Education, 119(2), 295–316. Dockery, A., Koshy, P., & Li, I. (2019). Culture, migration and educational performance: A focus on gender outcomes using Australian PISA tests. Australian Educational Researcher, 47, 1–21. https://doi.org/10.1007/s13384-019-00321-7 Coll, R. K., Dahsah, C., & Faikhamta, C. (2010). The influence of educational context on science learning: A cross-national analysis of PISA. Research in Science & Technological Education, 28(1), 3–24. Dronkers, J., & Kornder, N. (2014). Do migrant girls perform better than migrant boys? Deviant gender differences between the reading scores of 15-year-old children of migrants compared to native pupils. Educational Research and Evaluation, 20(1), 44–66. Dronkers, J., Van Der Velden, R., & Dunne, A. (2012). Why are migrant students better off in certain types of educational systems or schools than in others? European Educational Research Journal, 11(1), 11–44. Edgerton, J. D., Roberts, L. W., & Peter, T. (2013). Disparities in academic achievement: Assessing the role of habitus and practice. Social Indicators Research, 114(2), 303–322. Entorf, H., & Lauk, M. (2008). Peer effects, social multipliers and migrants at school: An international comparison. Journal of Ethnic and Migration Studies, 34(4), 633–654. Gramat‚ki, I. (2017). A comparison of financial literacy between native and immigrant school students. Education Economics, 25(3), 304–322. https://doi.org/10. 1080/09645292.2016.1266301 Guo, J., Marsh, H. W., Parker, P. D., Morin, A. J., & Yeung, A. S. (2015). Expectancy-value in mathematics, gender and socioeconomic background as predictors of achievement and aspirations: A multi-cohort study. Learning and Individual Differences, 37, 161–168. Gustafsson, J.-E., Nielsen, T., & Yang Hansen, K. (2018). School characteristics moderating the relation between student socioeconomic status and mathematics achievement in grade 8. Evidence from 50 countries in TIMSS 2011. Studies in Educational Evaluation., 57(Special Issue), 16–30. https://doi.org/10.1016/j. stueduc.2016.09.004

51

Family Socioeconomic and Migration Background Mitigating. . .

1483

Hána, D., Hasman, J., & Kostelecká, Y. (2017). The educational performance of immigrant children at Czech schools. Oxford Review of Education, 43(1), 38–54. https://doi.org/10.1080/03054985.2016.1235030 Hanushek, E. A., & Woessmann, L. (2006). Does early tracking affect educational inequality and performance? Differences-in differences evidence across countries. Economic Journal, 116(510), C63–C76. Howie, S., Scherman, V., & Venter, E. (2008). The gap between advantaged and disadvantaged students in science achievement in South African secondary schools. Educational Research and Evaluation, 14(1), 29–46. Huang, H. (2015). Can students themselves narrow the socioeconomic-status-based achievement gap through their own persistence and learning time? Education Policy Analysis Archives, 23(108). https://doi.org/10.14507/epaa.v23.1977 Huang, H., & Sebastian, J. (2015). The role of schools in bridging within-school achievement gaps based on socioeconomic status: A cross-national comparative study. Compare: A Journal of Comparative and International Education, 45(4), 501–525. https://doi.org/10.1080/03057925.2014.905103 Humlum, M. K. (2011). Timing of family income, borrowing constraints, and child achievement. Journal of Population Economics, 24(3), 979–1004. Hvistendahl, R., & Roe, A. (2004). The literacy achievement of Norwegian minority students. Scandinavian Journal of Educational Research, 48(3), 307–324. Ishida, K., Nakamuro, M., & Takenaka, A. (2016). The academic achievement of immigrant children in Japan: An empirical analysis of the assimilation hypothesis. Eductaional Studies in Japan: International Yearbook, 10, 93–107. Jacobs, B., & Wolbers, M. H. J. (2018). Inequality in top performance: An examination of cross-country variation in excellence gaps across different levels of parental socioeconomic status. Educational Research and Evaluation, 24(1–2), 68–87. https://doi.org/10.1080/13803611.2018.1520130 Jerrim, J. (2015). Why do East Asian children perform so well in PISA? An investigation of Western-born children of East Asian descent. Oxford Review of Education, 41(3), 310–333. Kalaycioglu, D. B. (2015). The influence of socioeconomic status, self-efficacy, and anxiety on mathematics achievement in England, Greece, Hong Kong, the Netherlands, Turkey, and the USA. Educational Sciences: Theory and Practice, 15(5), 1391–1401. Karakolidis, A., Pitsia, V., & Emvalotis, A. (2016). Mathematics low achievement in Greece: A multilevel analysis of the Programme for International Student Assessment (PISA) 2012 data. Themes in Science and Technology Education, 9(1), 3–24. Kim, D. H., & Law, H. (2012). Gender gap in maths test scores in South Korea and Hong Kong: Role of family background and single-sex schooling. International Journal of Educational Development, 32(1), 92–103. Kryst, E. L., Kotok, S., & Bodovski, K. (2015). Rural/urban disparities in science achievement in post-socialist countries: The evolving influence of socioeconomic status. Global Education Review, 2(4).

1484

V. Rolfe and K. Yang Hansen

Lam, B. O. Y., Byun, S. Y., & Lee, M. (2019). Understanding educational inequality in Hong Kong: Secondary school segregation in changing institutional contexts. British Journal of sociology of Education, 40(8), 1170–1187. Lam, T. Y. P., & Lau, K. C. (2014). Examining factors affecting science achievement of Hong Kong in PISA 2006 using hierarchical linear modeling. International Journal of Science Education, 36(15), 2463–2480. Lavrijsen, J., & Nicaise, I. (2015). New empirical evidence on the effect of educational tracking on social inequalities in reading achievement. European Educational Research Journal, 14(3–4), 206–221. Le Donné, N. (2014). European variations in socioeconomic inequalities in students’ cognitive achievement: The role of educational policies. European Sociological Review, 30(3), 329–343. Lenkeit, J., Schwippert, K., & Knigge, M. (2018). Configurations of multiple disparities in reading performance: Longitudinal observations across France, Germany, Sweden and the United Kingdom. Assessment in Education: Principles, Policy & Practice, 25(1), 52–86. https://doi.org/10.1080/0969594X.2017. 1309352 Levels, M., & Dronkers, J. (2008). Educational performance of native and immigrant children from various countries of origin. Ethnic and Racial Studies, 31(8), 1404–1425. https://doi.org/10.1080/01419870701682238 Levels, M., Dronkers, J., & Kraaykamp, G. (2008). Immigrant children’s educational achievement in Western countries: Origin, destination, and community effects on mathematical performance. American Sociological Review, 73(5), 835–853. Liu, H., Van Damme, J., Gielen, S., & Van Den Noortgate, W. (2015). School processes mediate school compositional effects: Model specification and estimation. British Educational Research Journal, 41(3), 423–447. Liu, Y., Wu, A. D., & Zumbo, B. D. (2006). The relation between outside of school factors and mathematics achievement: A cross-country study among the US and five top-performing Asian countries. Journal of Educational Research & Policy Studies, 6(1), 1–35. Luschei, T. F., & Chudgar, A. (2011). Teachers, student achievement and national income: A cross-national examination of relationships and interactions. Prospects, 41(4), 507–533. Ma, X. (2001). Stability of socio-economic gaps in mathematics and science achievement among Canadian schools. Canadian Journal of Education/Revue canadienne de l’education, 97–118. Marks, G. N., Cresswell, J., & Ainley, J. (2006). Explaining socioeconomic inequalities in student achievement: The role of home and school factors. Educational Research and Evaluation, 12(2), 105–128. https://doi.org/10.1080/ 13803610600587040 Martin, A., Liem, G., Mok, M., & Xu, J. (2012). Problem solving and immigrant student mathematics and science achievement: Multination findings from the Programme for International Student Assessment (PISA). Journal of Educational Psychology, 104(4), 1054–1073.

51

Family Socioeconomic and Migration Background Mitigating. . .

1485

Matěakejů, P., & Straková, J. (2005). The role of the family and the school in the reproduction of educational inequalities in the post-Communist Czech Republic. British Journal of Sociology of Education, 26(1), 17–40. Mere, K., Reiska, P., & Smith, T. M. (2006). Impact of SES on Estonian students’ science achievement across different cognitive domains. Prospects, 36(4), 497–516. Mohammadpour, E., & Abdul Ghafar, M. N. (2014). Mathematics achievement as a function of within-and between-school differences. Scandinavian Journal of Educational Research, 58(2), 189–221. Myrberg, E., & Rosén, M. (2006). Reading achievement and social selection in independent schools in Sweden: Results from IEA PIRLS 2001. Scandinavian Journal of Educational Research, 50(2), 185–205. Nehring, A., Nowak, K. H., & zu Belzen, A. U., & Tiemann, R. (2015). Predicting students’ skills in the context of scientific inquiry with cognitive, motivational, and sociodemographic variables. International Journal of Science Education, 37(9), 1343–1363. Netten, A., Voeten, M., Droop, M., & Verhoeven, L. (2014). Sociocultural and educational factors for reading literacy decline in the Netherlands in the past decade. Learning and Individual Differences, 32, 9–18. Nieto, S., & Ramos, R. (2015). Educational outcomes and socioeconomic status: A decomposition analysis for middle-income countries. Prospects, 45, 325–343. https://doi.org/10.1007/s11125-015-9357-y Nonoyama-Tarumi, Y. (2008). Cross-national estimates of the effects of family background on student achievement: A sensitivity analysis. International Review of Education, 54(1), 57–82. Park, H. (2008). The varied educational effects of parent-child communication: A comparative study of fourteen countries. Comparative Education Review, 52(2), 219–243. Pásztor, A. (2008). The children of guest workers: Comparative analysis of scholastic achievement of pupils of Turkish origin throughout Europe. Intercultural Education, 19(5), 407–419. https://doi.org/10.1080/14675980802531598 Piel, S., & Schuchart, C. (2014). Social origin and success in answering mathematical word problems: The role of everyday knowledge. International Journal of Educational Research, 66, 22–34. Pivovarova, M., & Powers, J. (2019). Generational status, immigrant concentration and academic achievement: Comparing first and second-generation immigrants with third-plus generation students. Large-Scale Assessments in Education, 7(1), 1–18. Pokropek, A., Borgonovi, F., & Jakubowski, M. (2015). Socio-economic disparities in academic achievement: A comparative analysis of mechanisms and pathways. Learning and Individual Differences, 42, 10–18. https://doi.org/10.1016/j.lindif. 2015.07.011 Powers, J., & Pivovarova, M. (2017). Analysing the achievement and isolation of immigrant and U.S.-born students: Insights from PISA 2012. Educational Policy, 31(6), 830–857.

1486

V. Rolfe and K. Yang Hansen

Rajchert, J. M., Żółtak, T., & Smulczyk, M. (2014). Predicting reading literacy and its improvement in the Polish national extension of the PISA study: The role of intelligence, trait- and state-anxiety, socioeconomic status and school-type. Learning and Individual Differences, 33, 1–11. https://doi.org/10.1016/j.lindif. 2014.04.003 Robert, P. (2010). Social origin, school choice, and student performance. Educational Research and Evaluation, 16(2), 107–129. Rutkowski, D., Rutkowski, L., Wild, J., & Burroughs, N. (2018). Poverty and educational achievement in the US: A less-biased estimate using PISA 2012 data. Journal of Children and Poverty, 24(1), 47–67. https://doi.org/10.1080/ 10796126.2017.1401898 Matsuoka, R. (2014). Disparities between schools in Japanese compulsory education: Analyses of a cohort using TIMSS 2007 and 2011. Educational Studies in Japan: International Yearbook, 8, 77–92. Schleicher, A. (2009). Securing quality and equity in education: Lessons from PISA. Prospects, 39(3), 251–263. Schmidt, W. H., Burroughs, N. A., Zoido, P., & Houang, R. T. (2015). The role of schooling in perpetuating educational inequality: An international perspective. Educational Researcher, 44(7), 371–386. Schnepf, S. V. (2007). Immigrants’ educational disadvantage: An examination across ten countries and three surveys. (Author abstract). Journal of Population Economics, 20(3), 527–545. Shapira, M. (2012). An exploration of differences in mathematics attainment among immigrant pupils in 18 OECD countries. European Educational Research Journal, 11(1), 68–95. Shera, P. (2014). School effects, gender and socioeconomic differences in reading performance: A multilevel analysis. International Education Studies, 7(11), 28–39. Shin, S. H., Slater, C. L., & Backhoff, E. (2013). Principal perceptions and student achievement in reading in Korea, Mexico, and the United States: Educational leadership, school autonomy, and use of test results. Educational Administration Quarterly, 49(3), 489–527. Straus, M. (2014a). (In)equalities in PISA 2012 mathematics achievement, socioeconomic gradient and mathematics-related attitudes of students in Slovenia, Canada, Germany and the United States. Šolsko Polje, 25(5/6), 121–143, 157–159. Spörlein, C., & Schlueter, E. (2018). How education systems shape cross-national ethnic inequality in math competence scores: Moving beyond mean differences. PLoS One, 13(3), e0193738. https://doi.org/10.1371/journal.pone.0193738 Straková, J. (2007). The impact of the structure of the education system on the development of educational inequalities in the Czech Republic. Sociologický časopis/Czech Sociological Review, 43(3), 589–610. Straus, M. (2014b). (In)equalities in PISA 2012 mathematics achievement, socioeconomic gradient and mathematics-related attitudes of students in Slovenia, Canada, Germany and the United States. Šolsko Polje, 25(5/6), 121–143, 157–159.

51

Family Socioeconomic and Migration Background Mitigating. . .

1487

Sun, L., Bradley, K. D., & Akers, K. (2012). A multilevel modelling approach to investigating factors impacting science achievement for secondary school students: PISA Hong Kong sample. International Journal of Science Education, 34(14), 2107–2125. https://doi.org/10.1080/09500693.2012.708063 Tan, C. Y. (2015). The contribution of cultural capital to students’ mathematics achievement in medium and high socioeconomic gradient economies. British Educational Research Journal, 41(6), 1050–1067. Tan, C. Y. (2017). Do parental attitudes toward and expectations for their children’s education and future jobs matter for their children’s school achievement? British Educational Research Journal, 43(6), 1111–1130. Televantou, I., Marsh, H. W., Kyriakides, L., Nagengast, B., Fletcher, J., & Malmberg, L.-E. (2015). Phantom effects in school composition research: Consequences of failure to control biases due to measurement error in traditional multilevel models. School Effectiveness and School Improvement, 26(1), 75–101. https://doi.org/10.1080/09243453.2013.871302 Teltemann, J., & Schunck, R. (2016). Education systems, school segregation, and second-generation immigrants’ educational success: Evidence from a country-fixed effects approach using three waves of PISA. International Journal of Comparative Sociology, 57(6), 401–424. https://doi.org/10.1177/0020715216687348 Tramonte, L., & Willms, J. D. (2010). Cultural capital and its effects on education outcomes. Economics of Education Review, 29(2), 200–213. Tucker-Drob, E. M., Cheung, A. K., & Briley, D. A. (2014). Gross domestic product, science interest, and science achievement: A person nation interaction. Psychological Science, 25(11), 2047–2057. van Hek, M., Kraaykamp, G., & Pelzer, B. (2018). Do schools affect girls’ and boys’ reading performance differently? A multilevel study on the gendered effects of school resources and school practices. School Effectiveness and School Improvement, 29(1), 1–21. https://doi.org/10.1080/09243453.2017.1382540 Veerman, G., & Dronkers, J. (2016). Ethnic composition and school performance in the secondary education of Turkish migrant students in seven countries and 19 European educational systems. International Migration Review, 50(3), 537–567. Weiss, C. C., & García, E. (2015). Student engagement and academic performance in Mexico: Evidence and puzzles from PISA. Comparative Education Review, 59(2), 305–331. Williams, J. H. (2005). Cross-national variations in rural mathematics achievement: A descriptive overview. Journal of Research in Rural Education, 20, 1–18. Willms, J. D. (2003). Literacy proficiency of youth: Evidence of converging socioeconomic gradients. International Journal of Educational Research, 39(3), 247–252. Willms, J. D. (2010). School composition and contextual effects on student outcomes. Teachers College Record, 112(4), 1008–1037. Wiseman, A. W. (2012). The impact of student poverty on science teaching and learning: A cross-national comparison of the South African case. American Behavioral Scientist, 56(7), 941–960.

1488

V. Rolfe and K. Yang Hansen

Xu, J., & Hampden-Thompson, G. (2012). Cultural reproduction, cultural mobility, cultural resources, or trivial effect? A comparative approach to cultural capital and educational performance. Comparative Education Review, 56(1), 98–124. Yang, Y. (2003). Dimensions of socio-economic status and their relationship to mathematics and science achievement at individual and collective levels. Scandinavian Journal of Educational Research, 47(1), 21–41. https://doi.org/10.1080/ 00313830308609 Yang, Y., & Gustafsson, J. E. (2004). Measuring socioeconomic status at individual and collective levels. Educational Research and Evaluation, 10(3), 259–288. Yang Hansen, K., Rosén, M., & Gustafsson, J. E. (2011). Changes in the multilevel effects of socio-economic status on reading achievement in Sweden in 1991 and 2001. Scandinavian Journal of Educational Research, 55(2), 197–211. https:// doi.org/10.1080/00313831.2011.554700 Yang Hansen, K., & Munck, I. (2012). Exploring the measurement profiles of socioeconomic background indicators and their differences in reading achievement: A two-level latent class analysis. The IERI Monograph Series, 5, 67–95. Zhang, L. C., & Sheu, T. M. (2013). Effective investment strategies on mathematics performance in rural areas. Quality & Quantity, 47(5), 2999–3017. Zhou, Y., Wong, Y., & Li, W. (2015). Educational choice and marketisation in Hong Kong: The case of direct subsidy scheme schools. Asia Pacific Educational Review, 16, 627–636. https://doi.org/10.1007/s12564-015-9402-9 Zuzovsky, R. (2008). Capturing the dynamics behind the narrowing achievement gap between Hebrew-speaking and Arabic-speaking schools in Israel: Findings from TIMSS 1999 and 2003. Educational Research and Evaluation, 14(1), 47–71. https://doi.org/10.1080/13803610801896562

Appendix C Indicators of Family Socioeconomic Status in PISA, TIMSS, and PIRLS in each Cycle Item coverage in PISA background questionnaires (Adams & Wu, 2003; OECD, 2005, 2009, 2012, 2014, 2017)

A desk to study at A quiet place to study A computer you can use for school work Books to help with your school work A dictionary Educational software Technical reference books Calculator

2000 • • • • •

2003 • • • • • •

2006 • • • • • •







2009 • • • • • • •

2012 • • • • • • •

2015 • • • • • • •

(continued)

51

Family Socioeconomic and Migration Background Mitigating. . .

1489

2000 2003 2006 2009 2012 2015 A room of your own • • • • • • A link to the Internet • • • • • • Books on art, music, or design • Number of televisions • • • • • Number of cars • • • • • Rooms with a bath or shower • • • • Cell phones with Internet access (e.g., • • • • • smartphones) Computers (desktop computer, portable • • • • • laptop, or notebook) Tablet computers (e.g., iPad ®, BlackBerry ® • PlayBook™) E-book readers (e.g., Kindle™, Kobo, • Bookeen) Musical instruments (e.g., guitar and piano) • A dishwasher • • • • • Classic literature (e.g., Shakespeare) • • • • • • Books of poetry • • • • • • Works of art (e.g., paintings) • • • • • • How many books are there in your home? • • • • • • Parental education • • • • • • Parental occupation • • • • • • NB. From 2003 onwards, participating countries have been able to specify an additional 3 country specific items indicating wealth Adams, R., & Wu, M. (Eds.). (2003). Programme for international student assessment (PISA): PISA 2000 technical report. OECD Publishing OECD. (2005). PISA 2003 technical report. Retrieved from Paris OECD. (2009). PISA 2006 technical report. Retrieved from Paris OECD. (2012). PISA 2009 technical report. Paris: OECD Publishing OECD. (2014). PISA 2012 technical report. Retrieved from Paris OECD. (2017). PISA 2015 technical report. Retrieved from Paris: http://www.oecd.org/pisa/ sitedocument/PISA-2015-technical-report-final.pdf

Item coverage in TIMSS background questionnaires for grade 8 (Foy, 2017; Foy et al., 2013; Foy & Olson, 2009; Gonzalez & Miles, 2001; Gonzalez & Smith, 1997; Martin, 2005)

How many books are in your home? Calculator Computer A computer or tablet of your own A computer or tablet that is shared with other people at home Study desk Dictionary

1995 • • •

1999 • • •

2003 • • •

2007 • • •

2011 •

2015 •

• • •

• •

• •

• •

• •



• (continued)

1490

V. Rolfe and K. Yang Hansen

1995 1999 2003 2007 2011 2015 Internet connection • • • Books of your very own (do not count your • school books) Your own room • • Your own mobile phone • A gaming system (e.g., PlayStation ®, Wii®, • XBox ®) NB. Participating countries have been able to specify additional country specific items indicating wealth Foy, P. (Ed.). (2017). TIMSS 2015 user guide for the international database. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College Foy, P., Arora, A., & Stanc, G. M. (Eds.). (2013). Timss 2011 user guide for the international database. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College Foy, P., & Olson, J. F. (Eds.). (2009). TIMSS 2007 user guide for the international database. TIMSS & PIRLS International Study Center,Lynch School of Education, Boston College Gonzalez, E. J., & Miles, J. A. (Eds.). (2001). TIMSS 1999 user guide for the international database. TIMSS Study Center, Boston College Gonzalez, E. J., & Smith, T. A. (Eds.). (1997). User guide for the TIMSS international database – Primary and middle school years 1995 assessment. TIMSS International Study Center, Boston College Martin, M. O. (Ed.) (2005). TIMSS 2003 user guide for the international database. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College

Item coverage in PIRLS questionnaires (Foy & Drucker, 2013; Foy & Kennedy, 2008; Gonzalez & Kennedy, 2003; Mullis & Martin, 2015) 2001 2006 2011 2016 How many books are in your home? • • • • Computer or tablet • • • • Study desk • • • • Internet connection • • Books of your very own (do not count your school books) • • • Your own room • • • Your own mobile phone • Daily newspaper • • NB. Participating countries have been able to specify additional country specific items indicating wealth Foy, P., & Drucker, K. T. (Eds.). (2013). PIRLS 2011 user guide for the international database: Supplement 1. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College Foy, P., & Kennedy, A. M. (Eds.). (2008). PIRLS 2006 user guide for the international database: Supplement 1. TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College Gonzalez, E. J., & Kennedy, A. M. (Eds.). (2003). PIRLS 2001 user guide for the international database: Supplement 1. International Study Center, Boston College Mullis, I. V. S., & Martin, M. O. (Eds.). (2015). PIRLS 2016 assessment framework (2nd ed ed.)

51

Family Socioeconomic and Migration Background Mitigating. . .

1491

References Avvisati, F. (2020). The measure of socioeconomic status in PISA: A review and some suggested improvements. Large-Scale Assessments in Education, 8, 8. https://doi.org/10.1186/s40536020-00086-x Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction -to metaanalysis. Wiley. Bourdieu, P. (1977). Cultural reproduction and social reproduction. In J. Karabel & A. H. Halsey (Eds.), Power and ideology in education. Oxford University Press. Bourdieu, P. (1984). Distinction. A social critique of the judgement of taste. Harvard University Press. Bourdieu, P. & Passeron, J. C. (1977). Reproduction in Education, Society and Culture. Beverly Hills, CA.: Sage Publications. Bourdieu, P., & Passeron, J.-C. (1990). Reproduction in education, society and culture. SAGE. Broer, M., Bai, Y., & Fonseca, F. (2019). Socioeconomic inequality and educational outcomes: Evidence from twenty years of TIMSS. Springer Nature. Buchmann, C. (2002). Measuring family background in international studies of education: Conceptual issues and methodological challenges. In National Research Council (Ed.), Methodological advances in cross-national surveys of educational achievement (pp. 150–197). The National Academies Press. https://doi.org/10.17226/10322 Coleman, J. S., Campbell, E. Q., Hobson, C. J., McPartland, J., Mood, A. M., Weinfeld, F. D., & York, R. (1966). Equality and educational opportunity. US Congressional Printing Office. Cowan, C. D., Hauser, R. M., Levin, H. M., Beale Spencer, M., & Chapman, C. (2012). Improving the measurement of socioeconomic status for the National Assessment of Educational Progress: A theoretical foundation. Retrieved from https://nces.ed.gov/nationsreportcard/pdf/ researchcenter/Socioeconomic_Factors.pdf Duncan, O. D. (1961). A socioeconomic index for all occupations. In A. J. Reiss Jr. (Ed.), Occupations and social status (pp. 139–161). Free Press. Duncan, O. D., Featherman, D. L., & Duncan, B. (1972). Socioeconomic background and achievement. Seminar Press. Gipps, C., & Stobart, G. (2010). Fairness. In P. Peterson, E. Baker, & B. McGaw (Eds.), International encyclopedia of education (3rd ed., pp. 56–60). Oxford. Gottfried, A. (1985). Measures of socioeconomic status in child development research: Data and recommendations. Merrill-Palmer Quarterly, 31(1), 85–92. Hattie, J. (2009). Visible learning: A synthesis of over 800 meta analyses relating to achievement. Routledge. Hauser, R. M. (1994). Measuring socioeconomic status in studies of child development. Child Development, 65(6), 1541–1545. Hvistendahl, R., & Roe, A. (2004). The literacy achievement of Norwegian minority students. Scandinavian Journal of Educational Research, 48(3), 307–324. Marks, G. N. (2013). Education, social background and cognitive ability: The decline of the social. Routledge. Matsuoka, R. (2014). Disparities between schools in Japanese compulsory education: Analyses of a cohort using TIMSS 2007 and 2011. Educational Studies in Japan, 8, 77–92. https://doi.org/10. 7571/esjkyoiku.8.77 Mueller, C. W., & Parcel, T. L. (1981). Measures of socioeconomic status: Alternatives and recommendations. Child Development, 52(1), 12–30. OECD. (2009). PISA 2006 technical report. OECD Publishing. OECD. (2017). PISA 2015 technical report. OECD Publishing. OECD. (2019). International migration outlook 2019. OECD Publishing. Rutkowski, D., & Rutkowski, L. (2013). Measuring socioeconomic background in PISA: One size might not fit all. Research in Comparative and International Education, 8(3), 259–278.

1492

V. Rolfe and K. Yang Hansen

Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International large-scale assessment data: Issues in secondary analysis and reporting. Educational Researcher, 39(2), 142–151. https://doi.org/10.3102/0013189X10363170 Sirin, S. R. (2005). Socioeconomic status and academic achievement: A meta-analytic review of research. Review of Educational Research, 75(3), 417–453. Sullivan, A. (2001). Cultural capital and educational attainment. Sociology, 35(4), 893–912. United Nations Department of Economic and Social Affairs. (2020). World social report 2020 – Inequality in a rapidly changing world. Available at https://www.un.org/development/desa/ dspd/wp-content/uploads/sites/22/2020/02/World-Social-Report2020-FullReport.pdf Von Davier, M., Gonzalez, E., & Mislevy, R. (2009). What are plausible values and why are they useful. IERI Monograph Series, 2(1), 9–36. White, K. R. (1982). The relation between socioeconomic status and academic achievement. Psychological Bulletin, 91(3), 461–481. Willms, J., & Raudenbush, S. (1989). A longitudinal hierarchical linear model for estimating school effects and their stability. Journal of Educational Measurement, 26(3), 209–232. Retrieved from http://www.jstor.org/stable/1434988

Part XIII Concluding Remarks

60-Years of ILSA: Where It Stands and How It Evolves

52

Agnes Stancel-Piątak, Trude Nilsen, and Jan-Eric Gustafsson

Contents Meta-Perspectives on ILSAs in Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theoretical Meta-Perspectives: Educational Accountability and the Role of International Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theoretical Frameworks and Assessed Domains in ILSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Populations and Design in ILSAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods in ILSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generalizability and Comparability of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analytical Potential of ILSA Data for Causal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Trend Analysis with ILSA Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Log-Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1496 1496 1499 1502 1503 1503 1504 1505 1506 1507 1515 1516

Abstract

The handbook provides extensive and comprehensive overview of theories, methods, and results from ILSA presented by excellent authors from the field of educational research and beyond. The discussion chapter aims at adding to this by providing a synthesis of the discussed topics as well as our perspectives on A. Stancel-Piątak IEA Hamburg, Hamburg, Germany e-mail: [email protected] T. Nilsen (*) Department of Teacher Education and School Research, Faculty of Educational Sciences, University of Oslo, Oslo, Norway e-mail: [email protected] J.-E. Gustafsson University of Gothenburg, Gothenburg, Sweden e-mail: [email protected] © Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8_51

1495

1496

A. Stancel-Piątak et al.

directions in which possible future developments could move. We hope to contribute by compiling the views and experiences of the chapters’ authors to portray a comprehensive illustration of the scientific landscape. The chapter follows roughly the structure of the handbook, focusing first on theoretical aspects followed by methodological considerations and closing with reflections on findings from ILSAs. Keywords

Discussion · Theory of ILSA · Method of ILSA · Findings of ILSA

It is impossible to provide an overview of all the issues tackled in this handbook in ways that reflect the comprehensiveness and complexity of the scientific discussions by excellent authors from the field of educational research and beyond. Instead, we would like to rely on the diverse views presented to discuss the status of ILSA, adding our perspective on directions in which it could possibly develop in the future. The chapters provide in-depth insights and critical views on specific as well as on more general topics. With this chapter we hope to contribute to this detailed and informative picture by compiling the views and experiences of the chapters’ authors to portray a comprehensive illustration of the scientific landscape. Overall, the handbook has a well-known structure, starting with theoretical approaches, then goes through methods and ends up discussing findings. While this structure hopefully supports the assimilation of the incredible amount of complex information provided in the handbook, it also enables a systematic reflection on each topic area. Thus, this chapter follows roughly the structure of the handbook, focusing first on theoretical aspects followed by methodological considerations and closing with reflections on findings from ILSAs.

Meta-Perspectives on ILSAs in Education Theoretical Meta-Perspectives: Educational Accountability and the Role of International Assessments Reasons for Participation in ILSA Reflections on ILSAs that neglect the sociopolitical context would be incomplete and superficial. Without any doubt, ILSAs have become one of the most powerful and widespread tools for monitoring education systems, establishing a position as an integral part of education policy and the society as whole. The developments accompanying this process cannot be attributed solely to human curiosity and interest in the findings from ILSAs. Large and persistent societal processes have systematically served as a trailblazer accelerating developments of ILSAs. Aside from the contributions of globalization and technical inventions to the growth of ILSAs, one of the most important sociopolitical factors was the neoliberal reforms of

52

60-Years of ILSA: Where It Stands and How It Evolves

1497

the 1980s and 1990s. These reforms resulted, among others, in significant changes within education systems around the world, which could be briefly described as a responsibility shift from the state to the school level, causing an emerging need for an outcome-oriented monitoring. As noted by Liu and Steiner-Khamsi, “. . .the outcomes orientation of new public management reform triggered a proliferation of standardized student assessment” (▶ Chap. 4, “Reasons for Participation in International Large-Scale Assessments”). While ILSA is already an inherent element of national policies in many education systems, it can be expected that this tendency will increase in the future, accelerated by various mechanisms ensuring the sustained interest in ILSAs (Liu and Steiner-Khamsi, ▶ Chap. 4, “Reasons for Participation in International Large-Scale Assessments”).

The Political Economy of ILSAs in Education: The Role of Knowledge Capital in Economic Growth From the national perspective, the consequences of the autonomy shift and the responsibility transfer from state to school level could be critical considering that the level of cognitive skills of a population is one of the strongest factors related to long-run economic growth of a country. Essential thereby is the finding that focusing solely on universal access results in lower economic growth than if universal access is combined with a focus on bringing all students to at least the minimum level (Hanushek and Woessmann, ▶ Chap. 3, “The Political Economy of ILSAs in Education: The Role of Knowledge Capital in Economic Growth”). In the light of these findings, it appears important to consider the extent to which granting schools with greater autonomy limits the impact of governments on their national education systems. Some of the governmental power might even shift toward supranational organizations, such as the OECD, leaving content decisions on topics taught, outcome assessment, and interpretation to global organizations, which might or might not be well-connected to each country’s national and cultural context. This is because outcome-based assessments, which are becoming one of the most important regulatory instruments, are created by these organizations allowing them to impose their own standards. In addition to the policy interest that might drive certain developments, ILSAs themselves might cause unintended effects by specific decisions, even at a “technical” level. As pointed out by Klieme the decision to change the mode from paper-and-pencil to technology-based assessments might, for instance, affect the country-level trend analyses (Klieme, ▶ Chap. 19, “Comparison of Studies: Comparing Design and Constructs, Aligning Measures, Integrating Data, Cross-validating Findings”). It is likely that this kind of impact could go beyond the analytical level, such as in case of technology-based assessment through its potential to making digitalization an important component of schooling. Educational Accountability and the Role of ILSAs in Economically Developed and Developing Countries The regulatory role of ILSAs expands worldwide regardless of the arguments for a careful interpretation of the results considering limitations and suitability of ILSAs to serve as a measure of school improvement and outcomes-based accountability

1498

A. Stancel-Piątak et al.

(Loeb, chapter, ▶ Chap. 5, “Educational Accountability and the Role of International Large-Scale Assessments”). ILSA opponents argue that the fact that ILSA data is regularly used for such purposes is not so much related to their appropriateness, but often simply caused by the lack of alternatives. Indeed, the availability of ILSA data corresponds to the increasing need for evidence-based policy making emerging because of the autonomy shift from the governmental level to the school level. This shift is argued to open space for flexibility in decision-making processes to the local principals and involved teachers. While this might strengthen motivation, enabling more efficient and goal-targeted resource investments, intensive standardized testing might in fact contradict this positive contribution. In this context, opponents perceive the increasing amount of standardized item batteries in multiple-choice format as contrary to the “openness, situational awareness and consideration for complexity” as a habit of mind when aiming at successful improvement (Ehren, ▶ Chap. 6, “International Large-Scale Assessments and Education System Reform,”). The risk that the theoretical potential for more freedom related to the autonomy shift might be counteracted by negative effects of frequent standardized testing should not be ignored, especially in those countries where alternative measures of educational effectiveness are not available. Besides some positive effects on education policy (Neuschmidt, Al-Maskari and Beyer, ▶ Chap. 37, “A Non-Western Perspective on ILSAs”), research shows that global testing might also have negative effects for instance on students’ well-being (Ehren, ▶ Chap. 6, “International LargeScale Assessments and Education System Reform”). It is reasonable to assume that the negative effects of standardized assessments are more preeminent in low- and middle-income countries (LMICs), where education systems and learning contexts can deviate significantly from those of the more developed and wealthier countries (Ahmed et al., ▶ Chap. 7, “The Role of International Large-Scale Assessments (ILSAs) in Economically Developing Countries”). It seems that the emergence of ILSAs was systematically pursued by global organizations particularly in these countries, as it coincided with the UN call for “Education for All” and was supported also by other global organizations such as the World Bank and the UNESCO. In this context Howie points to commercial and political motives behind a number of the regional LSAs (Howie, ▶ Chap. 17, “Regional Studies in Non-Western Countries, and the Case of SACMEQ”). The rapid developments of regional assessments in LMICs presumably occurred because of this. Supported by global stake holders ILSA participation gradually became increasingly important for LMICs. In an attempt to “regain some control within regions and refocus the emphasis of the studies onto more relevant (for regions) issues” governments increased efforts to consolidate resources to develop regional ILSAs (Howie, ▶ Chap. 17, “Regional Studies in Non-Western Countries, and the Case of SACMEQ”). Examples are the Pacific Island Literacy and Numeracy Assessment (PILNA), the South East Asia Primary Learning Metric (SEA-PLM; see Ahmed et al. in ▶ Chap. 7, “The Role of International Large-Scale Assessments (ILSAs) in Economically Developing Countries” of this handbook), the Southern and Eastern Consortium for Educational Quality (SACMEQ; see also Howie in ▶ Chap. 17, “Regional Studies in

52

60-Years of ILSA: Where It Stands and How It Evolves

1499

Non-Western Countries, and the Case of SACMEQ” of this handbook) and PASEC, LLLCE and SEA-PLM (UNICEF & SEAMEO, 2019).

Theoretical Frameworks and Assessed Domains in ILSA Comprehensive Frameworks in ILSA Not many national assessments can compete with the comprehensiveness of information gathered by ILSAs – regional or international – implemented in a highly standardized manner with representative samples enabling a broad understanding of various areas related to education. Nonetheless, the interpretation of the results has its limits, not only for national evidence-based policy, but also when it comes to comparisons across countries. A closer look into historical documents written by the developers of ILSA reveals that country ranking was not their aim (Stancel-Piątak & Schwippert, ▶ Chap. 8, “Comprehensive Frameworks of School Learning in ILSAs”). Instead, the goal was to compare “series of environments in which human beings learn,” while “respecting the national context in which education takes place” (Husén, 1967, p. 27). The habit to focus on league tables and on policy-oriented reporting is a relatively novel development that caused ILSA to harvest immense criticism from many scientists, especially those concerned with in-depth views on student learning, and from practitioners who perceive ILSAs as an intrusion. Considering the increasingly preeminent role of ILSA in education, sound theoretical foundations of the assessments became critical to provide high quality research characterized by reliable and valid measures and theory grounded interpretations. A closer look into the overall study designs reveals that these generally rely on a common understanding of structures and processes as overarching characteristics of education systems, despite differences in specific research and policy foci. Education systems are commonly perceived and analyzed in a global perspective, considering their local and national or regional context. Over the years, theoretical models have become increasingly complex and comprehensive, drawing on early developments in the 1960s when the idea of the opportunity to learn (OTL) was developed. The input-process-output paradigm, which played an important role in early ILSAs, has evolved to take on more of a circular nature involving reciprocal relationships. Drawing on economic and system theories, a comprehensive and elaborated framework describing learning as a dynamic process within the school system – such as for instance the dynamic model developed by Creemers and Kyriakides (2008) – has emerged and was implemented to various extents in several ILSAs (Stancel-Piątak and Schwippert, ▶ Chap. 8, “Comprehensive Frameworks of School Learning in ILSAs”). However, ILSA still faces substantial challenges related to the assessment of dynamic processes, to limitations of content coverage, and to insufficient inclusion of significant themes and topics. These challenges are implied by the holistic view on student learning as defined, for instance, in the dynamic model of educational effectiveness (Kyriakides et al., ▶ Chap. 12, “Using ILSAs to Promote Quality and Equity in Education: The Contribution of the Dynamic Model of Educational

1500

A. Stancel-Piątak et al.

Effectiveness”). While the assessment of dynamic processes requires longitudinal data, the comprehensiveness of the assessments can be improved by implementing, for instance, rotated designs, and cyclical repetitions of themes. Theoretical developments occur within the specific ILSAs rather independently from each other. The great increase in the number of studies and assessed topics over the years however raises the need to develop overarching approaches and research agendas to identify specific research gaps and to avoid over-testing of certain populations (StancelPiątak and Schwippert, ▶ Chap. 8, “Comprehensive Frameworks of School Learning in ILSAs”). In this context Kyriakides et al. (▶ Chap. 12, “Using ILSAs to Promote Quality and Equity in Education: The Contribution of the Dynamic Model of Educational Effectiveness”) stress the commonalities of ILSAs’ conceptual foundations and Educational Effectiveness Research (EER). This integration extends the boundaries of both, ILSAs to incorporate the dynamic conceptualization of the process of schooling, and EER to embrace not only the question “what works in education and why” but also “to finding out under which conditions and for whom these factors can promote different types of student learning outcomes” (Kyriakides et al., ▶ Chap. 12, “Using ILSAs to Promote Quality and Equity in Education: The Contribution of the Dynamic Model of Educational Effectiveness”).

Assessed Domains and Content Coverage For future ILSAs, it is desirable to extend theoretical underpinnings beyond the frameworks and results of previous ILSAs, in particular with respect to inclusion of new content. This pertains, among others, to the expansion of the assessed achievement domains, as pointed out by Mertes (▶ Chap. 13, “Overview of ILSAs and Aspects of Data Reuse”). While mathematics, reading, and science as well as civics and citizenship, computer, and information literacy have a longer research history, some of the more recent domains that comprise financial literacy, (collaborative) problem solving, computational thinking, and global competence are less well described in ILSA frameworks. Besides the recently implemented regional extensions, possible future developments pertain, for instance, to foreign languages, arts, and topics related to vocational schooling (Mertes, ▶ Chap. 13, “Overview of ILSAs and Aspects of Data Reuse”). Gjærum et al. present in their chapter (▶ Chap. 20, “ILSA in Arts Education: The Effect of Drama on Competences”) a reliable and valid tool (Drama Improves Lisbon Key Competences in Education, DICE) enabling the assessment of drama skills on a large scale as key competences promoted by the European Commission (2019). An additional area of development pertains to the conceptualization of cognitive outcomes in addition to content domains. While content domains have been at the focus of many ILSAs over the past decades and are in many respects sophisticated and well-developed, the assessment of cognitive domains is more challenging. Thus, theoretical conceptualization and definition becomes even more crucial in this context (Leung and Pei, ▶ Chap. 9, “Assessing Cognitive Outcomes of Schooling”). Similarly, the conceptualization and measurement of socioeconomic inequality remains a challenging research area in ILSAs, in particular as socioeconomic gaps are a persistent finding. The comparative perspective of ILSAs might be advantageous

52

60-Years of ILSA: Where It Stands and How It Evolves

1501

for exploring inequality mechanisms that often remain incompletely analyzed within a single country. However, as pointed out by Strietholt and Strello (▶ Chap. 10, “Socioeconomic Inequality in Achievement”) the value added of international comparisons does not come without a price. Studies partly differ in the indicators they use to assess the respondents’ socioeconomic status, making cross-study comparisons challenging. International classifications suffer from lack of cross-cultural validity and the indicators are subject to time-related changes (e.g.,: home possessions; importance and economic gains that come with a certain level of education). Achieving a balance between continuity and change, ensuring comparability over time and making necessary adjustments remains a constant challenge for the research on socioeconomic inequality (Strietholt and Strello, ▶ Chap. 10, “Socioeconomic Inequality in Achievement”). Another area of interest pertains to the content coverage achieved with ILSA assessments and the representativeness of these in terms of opportunity to learn (OTL). When relating OTL to student outcomes, it seems that student reports of content covered provide information that is most reliable, at least as long as they can be assumed to be unconfounded with quality of instruction, attitudes, aptitudes, and prior knowledge, as pointed out by Luyten and Scheerens (▶ Chap. 11, “Measures of Opportunity to Learn Mathematics in PISA and TIMSS: Can We Be Sure that They Measure What They Are Supposed to Measure?”). The authors state however that differences between countries in reference frames for constructing OTL measures (national standards, taxonomies of educational objectives, or assessed content) play an important role and might trim the cross-national and cross-study comparability of assessment measures. Blömeke et al. (▶ Chap. 22, “Conceptual and Methodological Accomplishments of ILSAs, Remaining Criticism and Limitations”) remind that for the sake of credibility, the choice of assessment domains in ILSAs should be guided by the importance and relevance of the domains, and not by whether they are easy to assess or not. While this is an important claim, ILSAs inability to assess everything regarded important is an obvious limitation, especially considering the number and heterogeneity of participating populations. One possible response to this challenge could be in country-specific information supply with evidence unique to the national contexts, implementing also other types of assessments, such as qualitative or longitudinal studies. According to Blömeke et al. (▶ Chap. 22, “Conceptual and Methodological Accomplishments of ILSAs, Remaining Criticism and Limitations”), this would require careful and methodologically sound linking of national and international assessments. Recently, increasing interest in linking ILSAs to national assessments has led to various developments of this kind, of which some examples are given in this handbook (Greger et al., ▶ Chap. 34, “Extending the ILSA Study Design to a Longitudinal Design”; Wendt and Schwippert, ▶ Chap. 35, “Extending International Large-Scale Assessment Instruments to National Needs: Germany’s Approach to TIMSS and PIRLS”; García-Crespo et al., ▶ Chap. 36, “Extending the Sample to Evaluate a National Bilingual Program”). While national extensions and linkage between assessments represent interesting and valid ideas for future developments, the general trend aiming at establishing universal scales for large-scale assessments is stated to

1502

A. Stancel-Piątak et al.

be – softly speaking – “the most ambitious research goal in the field of ILSAs” driven by policy needs rather than by a scientific or practical interest (Klieme, ▶ Chap. 19, “Comparison of Studies: Comparing Design and Constructs, Aligning Measures, Integrating Data, Cross-Validating Findings”). After reading many early ILSA documents, it does not seem to have been at the scientific agenda of the origins of ILSA either (Stancel-Piątak and Schwippert, ▶ Chap. 8, “Comprehensive Frameworks of School Learning in ILSAs”).

Populations and Design in ILSAs Populations As far as investigated populations are concerned, a clear focus has been on students from the early years and onward. Teachers and/or principals are treated as secondary units of analysis which has an impact on the generalization of the results from existing ILSAs. Teacher analysis typically pertains to the “teachers of the student population” assessed by each ILSA, except for those which explicitly focus on teachers. The most straightforward way to circumvent this would be to conduct studies with the focus on teachers as the primary sampling unit, including teachers of various subjects such as is the case with the TALIS (Teaching and Learning International Survey). Another property of ILSAs is the focus on a few selected age groups or school grades. Expanding populations beyond the fourth and eighth graders (or 15-year olds) could be one of possible future extensions (Mertes, ▶ Chap. 13, “Overview of ILSAs and Aspects of Data Reuse”). The first investigations in preschool education have been implemented within the scope of the OECD Starting Strong Survey (Sim et al., 2019), but the assessment of the youngest children still remains a challenge. Another example of an ILSA that contributed to extending the scope of populations was the IEA’s TEDS-M, which assessed pedagogical knowledge of future teachers. Blömeke sees its contribution among others in the development of “crucial conceptual ideas (. . .) that influenced educational research and thus educational theory in general” (Blömeke, ▶ Chap. 15, “IEA’s Teacher Education and Development Study in Mathematics (TEDS-M)”). The study included constructs which are challenging in terms of assessment such as mathematics pedagogical content knowledge or general pedagogical knowledge. Unfortunately, except for a national extension in Germany where OTL of future German, English, and Mathematics teachers was investigated (Stancel-Piątak et al., 2013), the study has not been followed up. Quantitative and Qualitative approaches in ILSA While ILSAs predominantly use quantitative research designs, qualitative components could be added to investigate specific topics for which quantitative assessment does not provide sufficient insight. As pointed out by Klette (▶ Chap. 18, “The Use of Video Capturing in International Large-Scale Assessment Studies: Methodological and Theoretical Considerations”), pioneering work was conducted by Stigler and Hiebert (1997, 1999) already in the first and second Trends in International

52

60-Years of ILSA: Where It Stands and How It Evolves

1503

Mathematics and Science Survey (TIMSS) video studies. In these studies, Stigler and Hiebert used video clips from three and seven countries, respectively, to study teachers’ instructional repertoires. While video studies still are rare in international large-scale assessment research, general developments in video design and technology have contributed to a rapid increase in studies of classroom teaching and learning. As is observed by Klette, this has paved the way for a new generation of international comparative studies using videos to measure and understand teaching quality across contexts.

Methods in ILSA Generalizability and Comparability of Results Roughly, methods in ILSA can be organized around methods of design, implementation, and analysis. The generalizability of results is an overarching theme present in these sections. In general, independent of the specific sampling approach, all ILSA data stem from a sample that was selected not fully at random. Usually, stratification and multistage cluster sampling are applied to reduce the costs and administrative burden and to achieve a representative sample from the assessed population. Thus, as pointed out by Meinck and Vandenplas (▶ Chap. 23, “Sampling Design in ILSA”) and by Gonzalez (▶ Chap. 28, “Secondary Analysis of Large-Scale Assessment Databases”), specific methods of analysis (whether secondary or primary) have to be applied to achieve unbiased population estimates. A major concern is the comparability of results across countries and cycles, which can only be guaranteed when errors are random and small, as for instance when the precision of the population estimates is high. To ensure this, several quality standards must be set in place and applied at every step of the assessment. ILSA can only maintain its high quality through attention to the definitions of the target population, sampling procedures, data collection and weighting, and suitable (weighted!) analysis. Over the years, several useful methods were established and researched in the context of ILSAs. Particular attention has been on measurement issues with a special focus on the assessment part of the surveys challenged by the increasing number of participating countries with heterogeneous cultural and societal structures. In reaction to these developments, cross-cultural comparability has recently become more of a concern (Lyons-Thomas et al., ▶ Chap. 25, “Implementing ILSAs”). In addition, the gradual shift from paper-based to computer-based assessment accompanied by increased numbers of items and topics fostered the implementation of measures to achieve fit between populations, assessment modes, and the measurement. In this context, Rutkowski and Murkowski (▶ Chap. 24, “Designing Measurement for All Students in ILSAs”) point out that there have also been promising attempts to implement item rotation not only for the assessment part but also for the background questionnaires. Another response to these challenges could be the implementation of a multistage testing design (Rutkowski, Rutkowski and Valdivia, ▶ Chap. 27, “Multistage Test Design Considerations in International Large-Scale Assessments

1504

A. Stancel-Piątak et al.

of Educational Achievement”). Particular interest pertains to the cross-cultural comparability of self-reported data, a topic that came in focus in the last decade almost as a side effect of the increasing applications based on latent (indirectly measurable) constructs. Currently, this topic is predominantly discussed under the label of “response bias” in which country differences in scale performance are perceived as an undesirable factor, which should be eliminated to enable mean comparisons and country rankings (Ikeda et al., ▶ Chap. 16, “OECD Studies and the Case of PISA, PIAAC, and TALIS”; He et al., ▶ Chap. 31, “Cross-Cultural Comparability of Latent Constructs in ILSAs”). This perspective is relevant from the policy perspective but it is less valid from the scientific point of view. While the former is interested in fully standardized data aligned between countries to the greatest extent possible, focusing most often on differences of levels (mean comparisons), the latter investigates differences between countries to allow for a deeper understanding of processes within each education system. Cross-country differences in latent constructs may – from the scientific point of view – be interesting findings that may align well with theoretical expectations. Moreover, differences in the meaning of the construct do not necessarily disturb the analysis as they can be considered while interpreting the results. Nevertheless, it might be challenging, as the benefit of this more comprehensive picture does not come without a price. What might be perceived as a disadvantage might be that the unidimensional straight world view with “easy to communicate” solutions has to be abandoned in favor of a more comprehensive and nuanced view, which leaves space for more subjective and heterogeneous interpretations. Apart from the fact that the difference in the latent construct are informative, exploring student learning, there is no real need to operate with universal constructs. The understanding of specific conditions that work well in each analyzed system is much more crucial for the success in education.

Analytical Potential of ILSA Data for Causal Analysis Methods of analysis have always been a topic of vivid research constantly triggering new developments. While some of the methods of analysis have passed the test of time such as descriptive techniques, alternative complex methods were proposed in response to the criticism of ILSA analysis (e.g., multiple regression to account for confounders) and to better suit the holistic approach (e.g., multilevel models, Scherer, ▶ Chap. 32, “Analyzing International Large-Scale Assessment Data with a Hierarchical Approach”). To enhance the analytical potential of ILSA data by increasing its accessibility, comprehensive documentation is provided with each ILSA cycle. More recently, the first central ILSA platform “Gateway” was established to support researchers interested in conducting secondary analyses. Important information about all major ILSAs is easily accessible via this platform (Mertes, chapter 14-1). A major limitation discussed in this context is the inability to draw causal conclusions form ILSA data. The cross-sectional design is perceived as an unconquerable barrier discouraging any efforts to explore questions related to mechanisms

52

60-Years of ILSA: Where It Stands and How It Evolves

1505

and processes. Only recently, education scientists have been discovering methods otherwise well established in other disciplines (e.g., econometrics) allowing for more robust causal statements, despite the cross-sectional structure. As discussed by Gustafsson and Nilsen (▶ Chap. 29, “Methods of Causal Analysis with ILSA Data”), the claim to avoid causal interpretations of analysis with ILSA data does not solve the problem, but rather pushes it beyond the range of vision. The authors criticize that through the use of seemingly noncausal formulations with causality claims implicitly and subtly built into the text (e.g., “teachers of students from disadvantage family background are less satisfied with their work environment”), the “take-home-massage” remains causal in its nature. The most feasible and methodologically “suitable” solution to this dilemma would be to implement the most necessary modifications to the study design. Blömeke et al. (▶ Chap. 22, “Conceptual and Methodological Accomplishments of ILSAs, Remaining Criticism and Limitations”) suggest that an effort to implement longitudinal designs could probably be feasible (although costly) if OECD and IEA would coordinate their efforts. While this remains a challenge due to several reasons (mainly legal and political), some countries have attached national longitudinal extensions to existing ILSAs as discussed in “Potentials and Methods of Linking ILSAs to National Data.” Nonetheless, the increasing number of cycles for each ILSA supports trend analyses, which are about to become a major aspect of ILSAs (Hastedt and Sibberns, ▶ Chap. 21, “Future Directions, Recommendations, and Potential Developments of ILSA”), hence allowing for causal interpretations, at least on country level.

Trend Analysis with ILSA Data Maybe the most preeminent challenge with every new ILSA cycle is the tension between the interest of ensuring measures that are comparable between cycles on the one hand, and the need for modifications to keep ILSAs updated with societal changes and changing policy interests on the other hand (e.g., Ikeda et al., ▶ Chap. 16, “OECD Studies and the Case of PISA, PIAAC, and TALIS”). However, some of the items of the assessment domains are regularly replaced by new ones. An IRT-based model is then used to link the assessments across cycles. While this method has proved to work quite well for the assessment it does not solve the issue for the background questionnaires, where much fewer items are administered per domain or trait (usually between 4 and 6 items). Any change of an item in these scales appears as more radical than in case of the assessment items. Hooper (▶ Chap. 26, “Dilemmas in Developing Context Questionnaires for International Large-Scale Assessments”) recommends avoiding trend measurement of background constructs when substantive wording has changed for items, or at least to provide strong psychometric evidence to justify measuring trends when items are added, changed, or deleted from scales or indices. A possible solution proposed by Mullis and Mullis (▶ Chap. 14, “IEA’s TIMSS and PIRLS: Measuring Long-Term Trends in Student Achievement”) could be to “implement new promising ideas (. . .) on a small scale (. . .) together with the ongoing assessment.” Another solution would be to

1506

A. Stancel-Piątak et al.

administer the new scale together with the old one in the scope of the same cycle assigning these two scales randomly each to half of the sample in each country. This way, differences between the constructs could be evaluated directly in the respective cycle and considered while reporting. Another way to monitor population trends was proposed by Kaplan and Jude (▶ Chap. 30, “Trend Analysis with International Large-Scale Assessments”). The authors discuss the potential of the model-based forecasting approach to provide policy information obtained using Bayesian prediction models. Borrowing from a long tradition of work on demographic forecasting, the authors present the predictive model-based approach to allow not only for the presentation of trend, but also to explicitly forecast changes in student outcomes over time. In a comparative perspective Klieme provides insights on matching of empirical measures and how to link data from separate ILSAs for trend analysis purposes as well as for linkage of international to national data (▶ Chap. 19, “Comparison of Studies: Comparing Design and Constructs, Aligning Measures, Integrating Data, Cross-validating Findings”). According to him, despite that ILSAs seem to agree roughly on country rankings, they seem not be well aligned between different cycles for trend purposes. Aside from cultural specifics of countries, “big-data” analysis will have to consider various aspects of conceptualization, measurement, and data structure to provide robust findings, of which the modular approach for questionnaire design in PISA may serve as an example. Joint efforts of the IEA, the OCED, and UNESCO may ease the combination of data from ILSAs (Klieme, ▶ Chap. 19, “Comparison of Studies: Comparing Design and Constructs, Aligning Measures, Integrating Data, Cross-validating Findings”).

Log-Data With the computer-based assessment becoming a new standard in ILSA, the potential to use log-data has recently emerged. Rich information on a range of actions performed by the respondents in the computer testing application is available (Hastedt and Sibberns, ▶ Chap. 21, “Future Directions, Recommendations, and Potential Developments of ILSA”). As presented by Costa and Netto (▶ Chap. 33, “Process Data Analysis in ILSAs”), process data provide useful insights into the respondent’s cognitive processes and can potentially become a relevant element in the scoring process of an assessment. They also point out that process data can validate test score interpretations. In response to the complex and hierarchical structure of the education systems, Costa and Netto propose the ecological framework for analysis of process data that offers a visual mapping of the analysis strategies for a deep and comprehensive exploration of process data from ILSAs (▶ Chap. 33, “Process Data Analysis in ILSAs”). The authors point out that despite the promising potential of log data, further investigations are needed to enhance its use. In particular, they point to the lack of substantial theoretical foundation and/or software limitations, concluding that modern testing software is still unable to identify which user actions are relevant when an examinee is interacting with an

52

60-Years of ILSA: Where It Stands and How It Evolves

1507

item. Considering the increased attention log data has received in recent years, it can be speculated that developments in this area will increase in the next few years.

Findings The overarching aim of the Findings part of the handbook is to provide reviews of the most common fields within the school, classroom, and student levels of the school system. Most authors have reviewed findings from research based on ILSA data and have discussed the findings in light of meta-studies, systematic reviews, longitudinal studies, and other studies not using ILSA data. Such reviews are important for three reasons. First, the number of secondary analyses of ILSA data has grown dramatically during the last decades (Hopfenbeck et al., 2018). There is, hence, a need to synthesize findings from these studies. As many chapters in this handbook have described, the quality of the studies has increased over time (see, e.g. Nilsen and Teig, ▶ Chap. 38, “A Systematic Review of Studies Investigating the Relationships Between School Climate and Student Outcomes in TIMSS, PISA, and PIRLS”; Klieme and Nilsen, ▶ Chap. 39, “Teaching quality and student outcomes in TIMSS and PISA”), especially due to the growing availability of more advanced tools and techniques in research, and better data. Second, even though many existing systematic reviews and meta-analyses include cross-sectional studies, they rarely include studies using ILSA data. Third, the number of studies in educational research in general has increased, and educational policy, research, and practice are in need of syntheses within different fields.

Schools, Principals, and Institutions This first section, “Schools, principals, and institutions,” addresses the most important area of research at the school level, namely, school climate. Nilsen and Teig (▶ Chap. 38, “A Systematic Review of Studies Investigating the Relationships between School Climate and Student Outcomes in TIMSS, PISA, and PIRLS”), performed a systematic review of research on school climate. School climate is here understood in its broadest sense. The authors build their theories on Wang and Degol (2015) and Thapa et al. (2013). Figure 1, taken from the chapter by Nilsen and Teig, illustrates the framework according to which school climate includes four main dimensions that each contains several aspects. The first dimension includes all academic activities at school, including leadership, teaching and learning, and professional development. The systematic review included studies that used TIMSS, PISA, and/or PIRLS data to investigate relations between school climate and student outcomes (cognitive and affective). First the authors examined how school climate is assessed in these ILSAs. They then used the systematic review to examine what characterized the studies included in terms of school climate dimensions and student outcomes, data and samples, and methodological appropriateness. Last, they examined the patterns of findings with regard to the relationships between school climate dimensions and student outcomes.

1508

A. Stancel-Piątak et al.

Fig. 1 Dimensions of school climate. Taken from the chapter by Nilsen & Teig in this handbook

Leadership Academic

Teaching and learning Professional developement

School climate

Partnership Quality of relationships Community Connectedness Respect for diversity Social/emotional Safety

Dicipline & order Physical Environmental adequacy

Institutional environment

Structural organization Availability of resources

As mentioned previously, the goal of the present chapter is not to provide extensive summaries of all the chapters of the handbook. Rather, we here point to some affordances and limitations pointed out by the authors, and that are identified across several chapters. With regard to affordance, the two most important ones are, first, the large overlaps between the most prominent theoretical frameworks in the field and the contextual frameworks of ILSA, and, second, the alignment between the findings from secondary analyses of ILSA data and the findings from the larger field of research (e.g., meta-studies, longitudinal studies). Like the other chapters in this handbook presenting systematic reviews, Nilsen and Teig identified an increasing number of publications using ILSA data over time, indicating a need to synthesize findings. Teaching and learning was the aspect of school climate that completely dominated over other aspects. Moreover, cognitive outcomes dominated over affected outcomes, samples from lower secondary school dominated over primary school, mathematics dominated over science and reading, and the number of studies on high-performing countries dominated over low-performing countries. The focus on mathematics, cognitive outcomes, and lower secondary school has been observed also in other systematic reviews and reviews in this handbook (see, e.g. Eklöf, ▶ Chap. 46, “Student Motivation and Selfbeliefs”; Klieme and Nilsen, ▶ Chap. 39, “Teaching Quality and Student Outcomes in TIMSS and PISA”). A limitation of secondary studies pointed out in this school climate chapter and other chapters in the handbook, is the methodological flaws. More than half of the studies included in the systematic review in the school climate chapter did either not use the plausible values correctly, or failed to mention them. Moreover, a great many

52

60-Years of ILSA: Where It Stands and How It Evolves

1509

studies failed to use the weights correctly or did not take the hierarchical, clustered design into account. To summarize, the chapter points at the affordances by ILSAs in measuring all dimensions and aspects of school climate that the research field emphasizes as the most important. The findings from the secondary analyses of ILSAs are mostly aligned with findings from the overall field. However, the field is in need of studies focusing also on affective outcomes, and of studies in other subject domains than mathematics, and in developing countries. Moreover, there is a need for studies using data from primary school.

Classrooms, Teachers, and Curricula This section includes five chapters examining factors pertaining to the class-level. The chapters present reviews of studies using ILSA data within the areas of teachers’ instructional quality, inquiry, teacher competence and professional development, teachers’ beliefs, and homework. The inclusion of these chapters is founded in previous research and forms a coherent picture that describes teachers and teaching (see Fig. 2). There is consensus that teachers bring with them certain characteristics and competences into the classroom, and that this affects their instruction or behavior in the classroom (Baumert et al., 2010; Blömeke et al., 2020; Darling-Hammond, 2000; Klieme et al., 2009). Teachers’ characteristics may include, for instance, their beliefs as described in the review provided in the chapter by Price (▶ Chap. 42, “Teachers’ Beliefs”). A review of teacher competences and a thorough theoretical framework is provided in the chapter by Jentsch and König (▶ Chap. 41, “Teacher Competence and Professional Development”). While teacher competences and characteristics usually affect student outcomes via their instruction, there are sometimes direct effects on student outcomes (Nilsen & Gustafsson, 2016) as shown in Fig. 2. Teachers’ instructional quality is known to influence student outcomes (e.g., Klieme et al., 2009; Pianta & Hamre, 2009), as is reviewed and described in the

What the teachers bring to the classroom:

Teachers’ practices and instruction

Teacher competence

Instructional quality

(including inquiry) Teacher characteristics (e.g. beliefs)

Homework

Student outcomes

Fig. 2 The conceptual connection between the chapters

1510

A. Stancel-Piątak et al.

chapter by Klieme and Nilsen (▶ Chap. 39, “Teaching Quality and Student Outcomes in TIMSS and PISA”). Inquiry may also be considered a part of teachers’ instructional quality and is therefore included in Fig. 2 under “Teachers’ practices and instruction.” A systematic review of inquiry is provided in the chapter by Teig (▶ Chap. 40, “Inquiry in Science Education”). Instructional quality describes what happens in the classroom, more specifically teacher behavior, and hence does not include homework (Charalambous et al., 2021). However, homework is part of teaching and learning, and is hence included in Fig. 2 under “Teachers’ practices and instruction” and reviewed in the chapter by Fernández-Alonso and Muñiz (▶ Chap. 43, “Homework: Facts and Fiction”). The following describes how several findings from the five chapters overlap. First, the findings of the reviews generally align with those in the field in general as found in meta-studies, other reviews, longitudinal studies, and other studies not using ILSA data. For instance, Teig (▶ Chap. 40, “Inquiry in Science Education”) found that inquiry as an instructional approach was positively related to student affective outcomes while the relations to cognitive outcomes were mixed. Klieme and Nilsen (▶ Chap. 39, “Teaching Quality and Student Outcomes in TIMSS and PISA”) also found that findings on the relations between different dimensions of instructional quality and student outcomes for the most part aligned with findings from the field (Fauth et al., 2014; Praetorius et al., 2018; Seidel & Shavelson, 2007). Price (▶ Chap. 42, “Teachers’ Beliefs”) found a strong alignment between the field and studies performing secondary analyses of ILSA with regard to relations between teacher beliefs (e.g., self-efficacy, job-satisfaction), their instruction, and student outcomes, and states: “These ILSA studies clearly evidence that teachers’ beliefs matter for teaching and learning.” At the same time, there were some unexpected negative relations between predictors and outcomes in some of the chapters. For instance, Klieme and Nilsen (▶ Chap. 39, “Teaching Quality and Student Outcomes in TIMSS and PISA”) found negative relations between an aspect of instructional quality known as teacher support and student achievements. The authors of the chapters have explained this and other negative relations in terms of reversed causality due to the cross-sectional design of ILSA data. For instance, low-achieving students may report that they receive more support from teachers. Other unexpected negative relations are explained by curvilinear relations. For instance, Teig (▶ Chap. 40, “Inquiry in Science Education”) points out that inquiry follows a curvilinear relation with achievement; in other words, more is not always better (Teig et al., 2018). Similarly, Fernández-Alonso and Muñiz (▶ Chap. 43, “Homework: Facts and Fiction”) points to homework following a curvilinear path, and that high achievement is associated with moderate amounts of daily homework. The chapters also identified large variations across countries, indicating that teachers and their practices and instruction are to a large extent dependent on the cultural context. For instance, Jentsch and König (▶ Chap. 41, “Teacher Competence and Professional Development”) found in their chapter a tendency for higher teacher qualifications to be accompanied by higher student achievement at

52

60-Years of ILSA: Where It Stands and How It Evolves

1511

secondary school, and especially in mathematics, but that these findings varied across countries. The reviews of studies using ILSA data, all point at the same issues: many of the secondary studies have methodological flaws as they fail to take weights, plausible values, and the hierarchical cluster design of ILSAs into account. Moreover, the secondary studies are all performed with a great variety of methods and at different levels (student, class, school, and national level). For instance, Fernández-Alonso and Muñiz (▶ Chap. 43, “Homework: Facts and Fiction”) found that homework was analyzed at different levels, something that produced very different results. While studies performing the analyses at the class-level mostly found positive associations between time spent on homework and achievement, those performed at the studentlevel found nonsignificant or negative relations. One possible explanation for these seemingly contradictory results is that student-level analyses are sensitive to reverse causality effects according to which students who are lagging behind are given more homework, while positive class-level effects may be due to teachers using homework as part of their planned teaching. Another issue with the secondary analyses is that they conceptualize and operationalize the concepts in different ways. These studies would use other theoretical frameworks than those provided by the ILSAs to pick items. Furthermore, some would use single items to reflect a latent trait, while others would create constructs and make up names for the constructs without grounding them in theory. This makes comparisons across studies challenging. To conclude, the findings of the five chapters will contribute to the field, policy, and practice, especially because most findings are aligned with other studies not using ILSA data, longitudinal studies, and meta-studies. Longitudinal extensions of ILSAs would solve many of the validity issues described above, such as the unexpected negative relations. There is a need for more careful and coherent use of conceptualizations and operationalizations, and more correct use of methodology for studies using ILSA data. At the same time, these challenges also pertain to studies not using ILSA data, for instance, within the field of teaching (see, e.g., Charalambous et al., 2021). Future systematic reviews or meta-studies on ILSA data may gain from performing this on raw data (Scherer et al., 2021) rather than relying on secondary studies with a great many conceptualizations and operationalizations, and with analyses performed at different levels, and with different methods.

Students, Competences, and Dispositions Most countries include other outcomes in their curricula besides cognitive outcomes. Affective outcomes, such as student motivation and well-being, are also considered as outcomes. In addition, digital competences are increasingly considered important outcomes. This section includes four chapters reviewing and investigating four outcomes: cognitive outcomes in mathematics, reading, and science; digital competences; student motivation and beliefs; and student well-being. The first two of these chapters connect their studies to the UN sustainability goal number 4, SDG4. In

1512

A. Stancel-Piątak et al.

the first chapter, Mullis and Kelly (▶ Chap. 44, “International Achievement in Mathematics, Science, and Reading”) address an important topic of SDG4: that children, boys and girls alike, should reach a minimum proficiency level in mathematics and reading at the end of primary and secondary education. They do this by investigating to what degree countries reach a minimum proficiency in these subjects by gender, and also to what degree countries succeed in educating students with high proficiency. The benchmark levels (proficiency levels) in TIMSS and PIRLS provide the most valid and reliable proficiency levels of the ILSAs; the large number of items included in the test and the methodology for the cut-off points surpass that of other ILSAs (Olsen & Nilsen, 2017). In the second chapter, Schulz, Fraillon, Ainley, and Duckworth (▶ Chap. 45, “Digital Competences: Computer and Information Literacy and Computational Thinking”) also address SDG4, and more specifically the specification that all students should be able to achieve basic ICT skills. They do this by investigating both how digital competence has been conceptualized and assessed in ILSAs and review findings from the International Computer and Information Literacy Study (ICILS). The last two chapters address affective outcomes. Eklöf (▶ Chap. 46, “Student Motivation and Self-beliefs”) provides an extensive review of students’ motivation and self-beliefs, and also describes how these have been measured in the ILSAs. The goal is to situate the research done on motivation using ILSA data within the larger achievement motivation research field. The last chapter by Borgonovi (▶ Chap. 47, “Well-Being in International Large-Scale Assessments”) reviews students’ wellbeing, both in terms of how it has been measured in the ILSAs, and how it is related to student cognitive outcome. The findings from the four chapters point to affordances and limitations within these four areas of cognitive outcome: digital competence, student motivation, self-beliefs, and student well-being. Mullis and Kelly (▶ Chap. 44, “International Achievement in Mathematics, Science, and Reading”) found that, in most countries, students were able to reach a minimum proficiency level (about 90%) in reading, science, and mathematics in grade 4, but that in some countries with emerging economies, only a third of the students reached this level. They raise concerns for other developing countries not participating in ILSAs as the pattern there is most likely similar. This pattern is similar in grade 8, and no progress was found from grade 4 to 8. While gender equality was found in mathematics and science, there were fewer boys who reached the minimum proficiency level in reading as compared to girls. Schulz, Fraillon, Ainley, and Duckworth (▶ Chap. 45, “Digital Competences: Computer and Information Literacy and Computational Thinking”) describe two main domains of digital competence. The first, computer and information literacy, is defined as “the ability to use computers to investigate, create, and communicate in order to participate effectively at home, at school, in the workplace, and in society” (Fraillon et al., 2013, p. 17). The second is computational thinking defined as the ability to use digital technologies for solving problems. The authors found large gender gaps: the girls scored higher on computer and information literacy than boys, while boys demonstrated higher levels of computational thinking than girls. There were large variations within and across countries for both domains, but SES was a

52

60-Years of ILSA: Where It Stands and How It Evolves

1513

strong indicator across all countries. The authors raise a concern with regard to pedagogical practice and educational policy and suggest to increase teacher expertise in ICT. Eklöf (▶ Chap. 46, “Student Motivation and Self-beliefs”) found promising results for student motivation and self-beliefs, results from research using ILSA data were aligned with those from the larger field. Self-beliefs like self-efficacy and self-concept showed positive associations with performance, while findings were more mixed for intrinsic motivation, and especially for extrinsic motivation. A concern is raised with regard to comparisons of student self-beliefs and motivation across countries due to cultural differences that may result in different understandings of concepts and different response patterns across countries. Measurement invariance should be examined to evaluate whether constructs are comparable. Borgonovi (▶ Chap. 47, “Well-Being in International Large-Scale Assessments”) investigated five dimensions of well-being: cognitive, social, psychological, physical, and material across the ILSAs. The author points out that these measures were scarce in early cycles of ILSAs, but due to criticism, more recent cycles of ILSAs include more measures of well-being. The review showed strong associations between students’ social and psychological well-being and cognitive well-being. However, at the national level, the relation between psychological well-being and cognitive well-being follows an inverted U shape, so that students with the highest achievement report medium levels of psychological well-being. Students with immigrant background are disadvantaged both in terms of cognitive well-being, material well-being, and social and psychological well-being. Taken together, the four chapters raise concerns for developing countries in terms of both cognitive outcomes and digital competencies. Gender differences exist across most countries for reading where girls outperform boys, and for computational thinking where boys outperform girls. Affective outcomes are more challenging to compare across countries, as the measures may not be comparable across cultures. This could be the reason behind the curvilinear relation between well-being and achievement. Countries with high achievement may have a different response pattern and understanding of the concepts (Jia et al., 2017; Van de Vijver & Tanzer, 1997), although further research is needed to disentangle this.

Equity and Diversity This section addresses the UN sustainability goal 4, which aims to ensure inclusive and equitable quality education and promote lifelong learning opportunities for all (UNESCO, 2018). Due to the pandemic, where children across most countries have experienced school closings or digital instruction, research on educational inequality is more important than ever. It has been quite some time since meta-analyses in the field of educational inequality were performed (Sirin, 2005; White, 1982). Yet, the number of studies using ILSA data for secondary analyses has grown dramatically (e.g., Hopfenbeck et al., 2018; Nilsen and Teig, ▶ Chap. 38, “A Systematic Review of Studies Investigating the Relationships Between School Climate and Student Outcomes in TIMSS, PISA, and PIRLS”). ILSA data are well suited to investigate inequality due to the representative samples at the national levels, the opportunity to

1514

A. Stancel-Piątak et al.

compare findings across countries, and the high-quality measures that continue to improve due to feedback from the research society and the strict quality assurance procedures. It is thus high time that findings on educational inequality from the ILSAs are synthesized. The four chapters in this section do that and review different aspects of inequality: 1) gender inequality; 2) compositional effects in terms of dispersion and level of achievement, 3) equality for different age-groups, and 4) the relation between SES and migration status and student achievement, and how organizational factors may intensify or compensate for educational inequality. We will first provide very brief summaries of the key findings of the four chapters, and then discuss these findings. Two of the chapters have performed systematic reviews of very broad fields and will naturally produce more findings. The first chapter on gender inequality is a systematic review by Rosén et al. The key findings show that in over almost half a century of ILSA findings, the female disadvantage has decreased or vanished, and, for some countries, turned into a female advantage. In reading, girls outperform boys, while in mathematics and science, there are small or no gender gaps. For ICT skills and skills within civic and citizenship girls score higher, but boys have higher scores on computational thinking in most countries. Gender gaps increase with age, and boys show larger variability of test scores. The second chapter by Rjosk (▶ Chap. 49, “Dispersion of Student Achievement and Classroom Composition”) reviews compositional effects through dispersion and achievement levels. The key findings of the chapter with regard to dispersion are that findings were mixed, and no conclusions could be drawn. However, for the aggregated achievement level, the findings showed that students achieve better but have lower self-concept and interest when surrounded by more able peers (at the class or school level). The third chapter by García (▶ Chap. 50, “Perspectives on Equity: Inputs versus Outputs”) examined the relation between mother’s education (as a proxy for SES) and outcomes for different age groups using data from PISA 2012 and PIAAC 2012. The key findings showed that the strength of the relation between SES and outcomes varied across age groups throughout life, and across countries. For some outcomes, the relation was stronger the higher the age. The fourth chapter by Rolfe and Yang Hansen (▶ Chap. 51, “Family Socioeconomic and Migration Background Mitigating Educational-Relevant Inequalities”) is a systematic review that examined the relations between SES and migration status and student outcomes, as well as how organizational factors may intensify or compensate for educational inequality. The key findings showed that SES is positively related to outcome, but the strength varied according to the operationalization of SES, level of analyses, and methodology. Organizational factors of the educational system can intensify or compensate for educational inequality. Migration background was negatively related to student academic achievement, and the strength of the effect varied depending on the migration stage and it often disappeared when family SES was accounted for. There are several implications of the findings from the four chapters for policy, practice, and research. The chapters all point to rather large inequalities for many

52

60-Years of ILSA: Where It Stands and How It Evolves

1515

countries. While inequalities in terms of relations between SES and outcomes are rather stable across subject domains, they vary widely across countries. Gender inequalities are mostly stable across countries but vary according to subject (like the gender gaps in reading). The implications are that all countries need to look into the domains with gender gaps and try to identify what causes these gaps, while a number of countries need to find a way to bring down the importance of student home background for academic success. Policy makers and stake holders should be made aware that league tables reflecting the level of inequality across countries need to be interpreted with care, as the results depend on how inequality is measured, the level of analyses, and the methodology. Moreover, controlling for SES when investigating the relation between migration and students’ outcomes is of utmost importance as this often renders the relation insignificant. The implications for research on educational inequality using ILSA data is that such studies need to take extra care to handle the complex data correctly, and could profit from more multilevel modeling techniques. This issue is raised across all subsections of the part of the handbook called Findings.

Final Remarks As ILSAs developed to become regular system assessments of international scope overcoming many of the initial challenges, they are currently confronted with new hurdles. While advocates perceive the aim of ILSA as being committed to improving education, other authors criticize the underlying neoliberal agenda (Ehren, chapter, “International Assessments and School Accountability”). To embrace the ways in which ILSAs have contributed to and changed education and education systems, a variety of institutions and individuals with diverse goals should be considered. Although the initiators of early ILSAs were driven predominantly by research interest, currently ILSA data is used by different stake holders for various purposes, including rushed policy-decision. As pointed out by Blömeke et al. (▶ Chap. 22, “Conceptual and Methodological Accomplishments of ILSAs, Remaining Criticism and Limitations”), standardization and comparability are given highest priority, sometimes at the cost of oversimplification, while still requiring high investments and efforts for the implementation. In consequence, whoever wants to benefit from ILSA must accept the comprehensive but somewhat flat – and predominantly descriptive – view on reality and use the generated knowledge to gain an overall understanding in the light of international comparisons. Such a perspective draws on the potential of ILSAs to the maximum extent possible, but it is less suitable to explain mechanisms and causal effects. Many of the methodological limitations for which ILSAs are criticized are inherent in the design and nature of the studies. In the light of the criticism, the demand to handle ILSA data appropriately becomes ever more important. Blömeke et al. (▶ Chap. 22, “Conceptual and Methodological Accomplishments of ILSAs, Remaining Criticism and Limitations”). Despite the efforts undertaken to educate

1516

A. Stancel-Piątak et al.

researchers on the usage of ILSA data, the popularization of ILSA has caused a rapid increase in publications not always applying proper analytical methods required in the ILSA context. The spread of ILSA seems to follow a pattern well known in marketing according to which popularization correlates with loss of quality, and particularly so in the way the data is (mis)used, whether this is due to the lack of expertise or to ignorance. At the same time, the reviews done in the last part of the handbook, Findings, show that numerous excellent analysis on ILSA data is being conducted providing valuable insides into education and school learning in a comparative perspective. In all the fields and areas reviewed, most findings from publications using ILSA data were aligned with the larger field (e.g., meta studies, longitudinal studies, and other studies not using ILSA data). While ILSA may be useful for education policy purposes, it is challenging to implement a critical educated dialogue with a broad audience, which would adhere to the required scientific rigor. The tension between the power- and user-oriented disputes that are typical for political environments on the one hand and the criticalconstructive scientific culture on the other hand remains a challenge for the future dialogue. Concerning the interpretation and generalization of the results, it is necessary that policy makers and other interested stake holders and practitioners acquire basic skills to understand the scientific language. Nevertheless, scientists still are challenged to understand the policy makers’ needs and boundaries. Both societies need to learn to maintain the discourse.

References Baumert, J., Kunter, M., Blum, W., Brunner, M., Voss, T., Jordan, A., ... Tsai, Y.-M. (2010). Teachers’ mathematical knowledge, cognitive activation in the classroom, and student progress. American Educational Research Journal, 47(1), 133–180. Blömeke, S., Kaiser, G., König, J., & Jentsch, A. (2020). Profiles of mathematics teachers’ competence and their relation to instructional quality. ZDM, 52(2), 329–342. Charalambous, C. Y., Praetorius, A.-K., Sammons, P., Walkowiak, T., Jentsch, A., & Kyriakides, L. (2021). Working more collaboratively to better understand teaching and its quality: Challenges faced and possible solutions. Studies in Educational Evaluation, 71, 101092. Creemers, B., & Kyriakides, L. (2008). The dynamics of educational effectiveness. A contribution to policy, practice and theory in contemporary schools. Routledge. Darling-Hammond, L. (2000). Teacher quality and student achievement. Education Policy Analysis Archives, 8, 1. European Commission. (2019). Key competencies for lifelong learning. https://op.europa.eu/en/ publication-detail/-/publication/297a33c8-a1f3-11e9-9d01-01aa75ed71a1/language-en Fauth, B., Decristan, J., Rieser, S., Klieme, E., & Büttner, G. (2014). Student ratings of teaching quality in primary school: Dimensions and prediction of student outcomes. Learning and Instruction, 29, 1–9. Fraillon, J., Schulz, W., & Ainley, J. (2013). International computer and information literacy study: Assessment framework. ERIC. Hopfenbeck, T. N., Lenkeit, J., El Masri, Y., Cantrell, K., Ryan, J., & Baird, J.-A. (2018). Lessons learned from PISA: A systematic review of peer-reviewed articles on the programme for international student assessment. Scandinavian Journal of Educational Research, 62(3), 333–353. Husén, T. (1967). International study of achievement in mathematics: A comparison of twelve countries. Almqvist & Wiksell.

52

60-Years of ILSA: Where It Stands and How It Evolves

1517

Jia, H., Van de Vijver, F. J. R., & Kulikova, A. (2017). Country-level correlates of educational achievement: evidence from large-scale surveys. Educational Research and Evaluation, 23(5/6), 163–179. https://doi.org/10.1080/13803611.2017.1455288 Klieme, E., Pauli, C., & Reusser, K. (2009). The pythagoras study: Investigating effects of teaching and learning in Swiss and German mathematics classrooms. In T. Janik & T. Seidel (Eds.), The power of video studies in investigating teaching and learning in the classroom (pp. 137–160). Waxmann Publicing Co. Nilsen, T., & Gustafsson, J.-E. (2016). Teacher quality, instructional quality and student outcomes: relationships across countries, cohorts and time. Springer Nature. Olsen, R. V., & Nilsen, T. (2017). Standard setting in PISA and TIMSS and how these procedures can be used nationally. In Standard setting in education (pp. 69–84). Springer. Pianta, R., & Hamre, B. K. (2009). Conceptualization, measurement, and improvement of classroom processes: Standardized observation can leverage capacity. Educational Researcher, 38(2), 109–119. Praetorius, A.-K., Klieme, E., Herbert, B., & Pinger, P. (2018). Generic dimensions of teaching quality: The German framework of three basic dimensions. ZDM, 50(3), 407–426. Scherer, R., Siddiq, F., & Nilsen, T. (2021). The potential of international large-scale assessments for meta-analyses in education. Retrieved from https://psyarxiv.com/bucf9/download?format¼pdf Seidel, T., & Shavelson, R. J. (2007). Teaching effectiveness research in the past decade: The role of theory and research design in disentangling meta-analysis results. Review of Educational Research, 77(4), 454–499. Sim, M. P. Y., Belanger, J., Stancel-Piątak, A., & Karoly, L. (2019) Starting strong teaching and learning international survey 2018 conceptual framework. OECD education working papers no. 197. Sirin, S. R. (2005). Socioeconomic status and academic achievement: A meta-analytic review of research. Review of Educational Research, 75(3), 417–453. Stancel-Piątak, A., Faria, J. A., Dämmer, J., Jansing, B., Keßler, J.-U., & Schwippert, K. (2013). Lerngelegenheiten und Veranstaltungsqualität im Studienverlauf: Lehramt Deutsch-, Englischund Mathematik. [Opportunities to learn and quality of teaching during the lectureship study. Subject: German, English and Mathematics]. In S. Blömeke (Ed.), Professionelle Kompetenzen im Studienverlauf: Weitere Ergebnisse zur Deutsch-, Englisch- und Mathematiklehrerausbildung aus TEDS-LT (pp. 185–224). Waxmann. Stigler, J. W., & Hiebert, J. (1997, September). Understanding and improving classroom mathematics instruction. Phi Delta Kappa, 14–21. Stigler, J. W., & Hiebert, J. (1999). The teaching gap: Best ideas from the world’s teachers for improving education in the classroom. Free Press. Teig, N., Scherer, R., & Nilsen, T. (2018). More isn't always better: The curvilinear relationship between inquiry-based teaching and student achievement in science. Learning and Instruction, 56, 20–29. Thapa, A., Cohen, J., Guffey, S., & Higgins-D’Alessandro, A. (2013). A review of school climate research. Review of Educational Research, 83(3), 357–385. UNESCO. (2018). Quick guide to education indicators for SDG 4. Retrieved from http://uis.unesco. org/sites/default/files/documents/quick-guide-education-indicators-sdg4-2018-en.pdf UNICEF & SEAMEO. (2019). SEA-PLM 2019 assessment framework (1st ed.). United Nations Children’s Fund (UNICEF) & Southeast Asian Ministers of Education Organization (SEAMEO) – SEA-PLM Secretariat. ISBN: 978-616-7961-30-9. Van de Vijver, F., & Tanzer, N. (1997). Bias and equivalence in cross-cultural assessment. European Review of Applied Psychology, 47(4), 263–279. Wang, M.-T., & Degol, J. L. (2015). School climate: A review of the construct, measurement, and impact on student outcomes. Educational Psychology Review, 1–38. White, K. R. (1982). The relation between socioeconomic status and academic achievement. Psychological Bulletin, 91(3), 461.

Index

A Academic achievement, 1211, 1218, 1219, 1221, 1225, 1226, 1232, 1233 Academic climate, 1059 leadership, 1057, 1071 professional development, 1058, 1072 teaching and learning, 1057, 1072 Academic resilience, 528 Academic track, 964 Achievement dispersion, 1408, 1409, 1414, 1416, 1421, 1423 Achievement level, 1423 Actor network theory (ANT), 430 Adaption country-specific items, 981 cultural diversity, 981 teacher education systems, 981–982 Adaptive Problem Solving (APS), 402 Adaptive testing, 595, 633 Added value, 593–597 Adequacy, 203, 205 Adolescents, 1400 Advanced techniques, 262 Aesthetic education, 552 Aesthetic experience, 550 Aesthetic learning, 554 Aesthetic subjects, 548, 555–559, 572 Age effect, 813–816 Alignment methodologies, 224 American National Educational Longitudinal Study, 559 Analysis of National Learning Assessment Systems (ANLAS) project, 526 Analytical techniques conditioning on observed or latent variables, 825–826

fixed effects models, 812–817 instrumental variable techniques, 820–825 RDD, 817–820 Analytic-empirical approach, 111 Analytic function, 775 Applied drama, 557–558 Argumentative approach, 226 Aristotle, 551, 552 Arts education, 549 arts educators assess assessments, 567–570 competences, 558–559 drama as an arts education subject, 554–555 and knowledge, 549–552 learning with/through art, 556 Assessment of Higher Education learning Outcomes (AHELO), 106, 327, 614 Assessment of inquiry PISA, 1143–1147 in TIMSS, 1138–1143 Attained curricula framework, 431 Augmented neoclassical growth theories, 31 Australian Council for Educational Research (ACER), 108, 126, 724

B Background domains, 286 Balanced incomplete block design, 340 Beliefs, 333, 334 Benchmarking function, 774 Bias, 847 construct, 847 item, 848 method, 847–848 Big-data analysis, 1506 Big Five traits, 1441

© Springer Nature Switzerland AG 2022 T. Nilsen et al. (eds.), International Handbook of Comparative Large-Scale Studies in Education, Springer International Handbooks of Education, https://doi.org/10.1007/978-3-030-88178-8

1519

1520 Bildung, 549, 550 Bilingual program booklet rotation procedure, 1005 data analysis, 1006–1007 participants, 1001–1002 predictor variables of academic achievement, 1003–1004 reading comprehension test, 1002–1003 results, 1008–1011 school level variables, 1004 student level variables, 1005 target population and sampling, 1002 Booklet rotation procedure, 1006 Botswanan report, 455 Bourdieu’s framework, 621 Boxplots of students’ achievement, 973 Boys advantage, 1364 cognitive tasks, 1353 educational systems, 1360 and girls’ achievement, 1352 reading achievement, 1358 Bridging study, 596 Broad mathematics pedagogy curriculum, 360

C Calculus, 205 Calibration procedures, 493 Carroll’s model, 149 Causal analysis, 1504 Causal inference, 811–812 Causation, 805–806 Centre for Educational Research and Innovation, 558 Champions of Change project, 559 Chief Executive Officer (CEO), 126 Child development, 206 Children, 1400 Child well-being, 1326 CIPO taxonomy, 519 Civic and citizenship achievement, 1385 knowledge, 1370–1371 Civic competency, 716 Civic education, 612, 850 Civic knowledge, 716–717 Civics and citizenship, 287 Classroom Assessment Scoring System (CLASS), 484, 487 Classroom discourse, 484–486 Classroom Environment Study, 256 Classroom focused approach, 1113

Index Classroom level, 267 Classroom management, 1095, 1106 Classroom observations, 269 Classroom organization, 484 Classroom teaching and learning assessment, 472 calibration procedures, 493 challenges in video studies, 473–474 ethics, 495 interrater reliability, 493 master rating procedures, 492 preparation of observers, 492 professional development, 474 professional learning, video capturing in, 499–501 reform efforts, 474 research purpose, 474 scale points, 489–491 scoring procedures, 491–492 student outcomes, 494 student’s behavior, 493 subject matter specificity, 487–489 teacher evaluation, 474, 497–499 teacher’s behavior, 493 teaching practices, dimensions of, 476–485 teaching quality, 496 theoretical underpinning, 485–487 video documentation, 499 video studies, 495–496 Classroom video capturing, 473 Cleaning process, 295 Cluster sampling, 661, 668, 674, 675, 677, 678 Codetermination of dependent variables, 811 Cognitive abilities, 943 Cognitive activation, 478, 484, 1095, 1096, 1106 Cognitive behavior dimension, 184 Cognitive dimension, 1329 application, 182 component, 182 curriculum-oriented assessments, 182 education, 181 IRLS/PIRLS, 188, 189 levels of demand, 181 literacy-oriented assessments, 182 mathematics assessment frameworks, 183–185 PISA, 184–186, 188, 189 processes, 181 reading assessment frameworks, 187, 188, 190 science assessment framework, 185, 186 structure, 181

Index TIMSS, 183, 186, 187 universal cognitive abilities, 182 Cognitive domain, 184, 233, 307, 1161 Cognitive outcomes of schooling, 833 assessment, 177, 194, 195 cognitive dimension, 176, 177 content dimensions, 180, 181 diversity, 177 educational outcomes, 176, 177 format of assessment, 190, 191 implications, 195–197 international educational assessment, 177 levels of complexity, 177 organization, 178–180 performance assessment task, 191, 192 PISA, 191, 193 TIMSS, 191, 193 Cognitive prerequisites, 1353 Cognitive skills assessment, 19, 1497 Coleman Report, 1436 Collaborative problem-solving, 595 Common Core State Standards, 609 Common European Framework of Reference for Languages, 999 Community of Madrid, 998–1002, 1004, 1009, 1012, 1013 Comparability, 702–704, 707–710, 717 Comparative assessments, 1370 Comparative education, 1437 Competence assessment, 964 Complex problem solving (CPS), 930 Complex samples, 661, 668, 673, 681 Comprehensive model of educational effectiveness, 152 Comprehensive technical documentation, 297 Computational thinking (CT), 287, 706, 1275, 1276, 1278, 1372 Computer Aided Personal Interview (CAPI), 395 Computer and information literacy (CIL), 287, 706, 1275, 1276, 1371–1373, 1385 Computer-based assessment (CBA), 294, 397, 593–597, 631, 632, 930, 937, 948 Computer-based multistage adaptive design, 293 Computer literacy, 613 Computers in Education Study (COMPED), 283, 612 Concurrent calibration assessments, 316 procedure, 317 transformation constants, 318 trends over time, 316

1521 Conditioning, 315–316 Conducting analyses, 816–817 Conference of Ministers of Education of French-Speaking Countries (CONFEMEN), 177, 283 Configural equivalence, 849 Confirmatory factor analysis (CFA), 860 Construct bias, 847 Consumer price index, 636 Content and Language Integrated Learning (CLIL), 999 Content knowledge (CK), 1169 Content validity assessment/analytical frameworks, 229, 233–235 concurrent validity, 237–238 eye to content validity, 237–238 face validity, 238 OTL measures in TIMSS and PISA, 231–232 similarities, 236 Context-input-process-output (CIPO) model, 16, 17, 1147 Contextual background, 996, 1000 Contextual effects model, 883 Contingency, 153 Convergent validity assessment, 229–231 data analysis, 238–241 key findings, 242 Cooperative learning, 965 Correlation, 242, 805–806 Critical discourse analysis (CDA), 430 Critical thinking, 558 Cross-country differences, 1504 Cross-cultural comparability of latent constructs, in ILSAs, 865 CFA model fit, 860 configural equivalence, 849 construct bias, 847 IRT item fit, 861 item bias, 848 method bias, 847–848 metric equivalence, 849 MGCFA, 860–861 model-based approaches, 861–862 PCA, 859 scalar equivalence, 849 scale reliability, 859 scales’ correlations, 860 Cross-cultural validity, 709 Cross-level interaction models, 884 Cross-national data, 328

1522 Cross-national survey, 663 Cross-sectional and longitudinal analyses, 813 Cross-sectional data, 159 Cross-sectional design, 1504 Cross-sectional surveys, 672 Cultural bias, 847–849, 860, 865 Cultural capital, 1469–1470 Cultural diversity, 981 Curricular levels, 16, 17 Curriculum, 1041–1042 alignment, 224 analysis, 338 approach, 288 coherence, 608 material approach, 1113 questionnaire, 294 Curvilinear relationship, 1217 Cyclical studies, 154 Czech language, 964 Czech Longitudinal Study in Education (CLoSE) academic track schools, 962 administration procedures, 963 assumptions, 959 cohort of lower secondary students, 961 competence assessment, 964 Czech language, 964 data collection, 960–963 design, 960 mathematics test, 963 motivation, 960 multi-cohort panel study, 959 multi-year gymnasia, 962 questionnaires, 964–965 reading comprehension test, 964 student questionnaire, 962 study sampling, 960–963 TIMSS & PIRLS 2011 data collection, 957 Czech lower secondary education, 957–960

D Data cleaning, 295 Data collection, 258, 292, 294 Data mining techniques, 944 Data production, 297 Data reuse, 297–300 Definition and Selection of Competencies (DeSeCo), 384 Depth of Knowledge (DOK), 177 Described Proficiency Scales (DPSs), 136 Design-unbiasedness, 660 Design weights, 669, 672

Index Developing countries, 422, 423, 433 Dewey, John, 553 Difference-in-differences (DD) approach, 1467 Differential item functioning (DIF), 366 Differentiation dimension, 267 Digitalization, 1358 Digitally based assessments (DBA), 85 DigitalPIRLS, 293 Distributive principles, 203 Drama education applied drama and theatre, 557–558 DICE study, 559–567 drama and arts hegemony in schools, 557 drama as an aesthetic discipline, 552–554 drama as an arts education subject, 554–555 international large-scale quantitative assessments, 568 learning about and in drama subject, 555–556 learning with/through drama, 556–557 quantitative measurements, 570 Drama Improves Lisbon Key Competences in Education (DICE), 565–567, 570, 571, 1500 data collection, 562–564 educational theatre and drama activities, 565 motivation, consortium, objectives and hypothesis, 560–561 qualitative research, 565 sample, 561 structured observation, 565 structured survey, 565 students’ questionnaire, 565 teachers’ questionnaire, 565 Drama skills assessment, 1500 Dynamic model, 153, 268

E Early Childhood Education and Care (ECEC) OECD Starting Strong Survey (2018), 166–167 Preprimary Project, 165–166 Early Grade Reading Assessment (EGRA), 616 Early tracking, 958, 960, 970 e-Assessment, 313–315 Ecological framework advancement of the field, 949 analysis of process data, 934–936 empirical studies, 937 pre-processing, 936 Economic, social and cultural status (ESCS), 207, 516, 688, 839, 1336

Index Economic growth average gains, 50 causality, 40–43 cognitive skills, 29, 30, 50 economic evidence, 29 education, 28, 49 gains, 47 GDP, 48, 49 human capital, 28, 30, 49 implications, 50 innovation and productivity improvements, 30 knowledge, 31, 32, 36–40 knowledge-led growth, 28 learning, 50 measuring human capital, 34–36 opportunities, 28 population, 28 regional tests, 30 school attainment, 32–34 schooling vs. cognitive skills, 37 secondary school enrolment rates, 44 share of students, 45 skills, 29, 50 sustainable development, 28 universal minimal skills, 51 years of schooling, 33, 39 Economic structural change, 207 Education, 203, 204 Educational accountability, 1496–1503 assessments, 77 central governments, 78 competitive and global economy, 77 controversial approach, 76 economic trends, 77 evidence-based, 76, 77 levels of assessment, 79 market-based, 78 outcomes-based, 76–78 policymakers, 77 political form, 78 reporting qualities, 79 stakeholders, 76, 78 student performance measures, 78 Educational aspirations, 1036 gender, 1039 nationality status, 1040 Educational assessment, 945 Educational attainment, 1436, 1438, 1441, 1444–1451, 1455 Educational career questionnaire, 294 Educational drama, 551, 552, 554–556, 565, 568–570

1523 Educational effectiveness research (EER), 151, 254, 259, 1500 advanced techniques, 262 dimensions, 267 dynamic model, 18, 264, 265, 1499 economic approach, 260 education production models, 260, 261 first phase, 260 ILSA, 263 methodological advancements, 262 multilevel analysis, 264 research, 263 scholars, 261, 264 second phase, 260 teaching/schooling/learning, 264 theoretical developments, 261 third phase, 260 Educational effectiveness studies, 262 Educational equity, 1465, 1470, 1472 Educational inequality, 205, 1466, 1467 Educational opportunity, 161 Educational psychologists, 261 Educational Quality and Assessment Programme (EQAP), 125 Educational researchers, 21–23 Educational systems, 1210 Educational Testing Service (ETS), 298, 724 Education for All (EFA), 64, 1498 Education for Sustainable Development (ESD), 559 Education goals, 224 Education policy, 153, 832 Education production models, 260 Education reform programs, 986 Education-related aspects, 1355 Education research, 280, 285 Education systems, 290 EFA Global Monitoring Report, 616 Effective classroom research, 152 Egalitarianism, 204 Electronically-delivered assessments, 938 Emotional safety, 1058 Emotional support, 484 Emotions, 1221 Empirical evidence, 260 Empirical foundation, 154 Enculturation, 1026 Encyclopedias, 308 Endogeneity bias, 811–812 problem, 807–810, 827–828 Endogenous growth models, 32 Environmental adequacy, 1059, 1075

1524 ePIRLS, 283, 293, 1358 Equal education, 1436 Equality, 203, 205, 1370, 1378, 1383, 1386 Equal opportunities, 1435, 1443, 1451, 1453 Equal probability sampling, 291 Equal-size sampling, 291 Equity, 1354, 1435 and excellence, 1443 in inputs and outcomes for students and adults, 1443–1451 with ILSAs, 1436–1443 Errors of measurement, 812 Estudio Regional Comparativo y Explicativo (ERCE), 284 Ethics, 495 eTIMSS, 292, 1157 European Commission, 559 European Frame of Reference for language learning (EFR), 525 Every Student Succeeds Act (ESSA), 79 Evidence-based policy-making (EBP) cognitive tests, 122 decision-making, 122 factors, 123–124 ILSAs, 122 LMICs, 121 policy implementation, 121 policy-making process, 121 system-level policies, 123 Evidenced-based decision making, 87 Expectancy-value theory, 1302 Expected a priori (EAP) proficiency, 760 Explicit stratification, 671, 680 Extensible Markup Language (XML), 930 Extrinsic motivation, 1302, 1303, 1306 Eye-tracking movements, 930

F Face validity, 238 Factor analysis, 862 Family background, 620 Family involvement, 1210, 1211, 1228, 1232–1234 Family SES, 1465, 1466, 1468 Fay’s balanced repeated replication variant, 795 Female advantage, 1368, 1371 disadvantage, 1386 gender, 1372 jobs, 1357 mathematics, 1363

Index polarization, 1355 psychological characteristics, 1353 teachers, 1377 Financial literacy, 288, 613, 715 Financial literacy questionnaire, 294 Finite-state machines, 945 Finnish educational system, 59 First International Mathematics Study (FIMS), 156–158, 183, 256, 281, 514, 1437 parental occupation, 521 student opinion booklet, 519 First International Science Study (FISS), 158–160, 185 Fixed effects models achievement penalty, 812 age effect on mathematics achievement, 813–816 conducting analyses, 816–817 fixed unit effects, 813 TIMSS both Science and Mathematics, 813 Focus dimension, 267 Follow Through, 259 Formal education, 549, 1024, 1029 Four-point Likert scales, 342 Framework for Teaching (FFT), 487 Frequency dimension, 267 Full-information-maximum-likelihood (FIML) procedure, 879 Functional Information Technology Test (FITT), 1277 Future lower-secondary mathematics teachers, 336, 363, 364, 366, 367 Future primary mathematics teachers, 362, 363 Future primary teachers, 364–366 Future teachers (FTs), 336 background, 342 beliefs, 341, 367 cross-national differences, 329 data analysis, 342 knowledge and beliefs, 331, 369 micro level, 332, 338 quality assurance, 357 Fuzzy regression, 824–825

G Gateway, 299, 300 Gender, 1029, 1033, 1034, 1036, 1039, 1041, 1042, 1046 Gender differences biological factors, 1354 cognitive abilities, 1353 in educational achievement, 1355

Index educational gender gaps, 1353 formal certificates, 1354 gender-related questions, 1354 literature review, 1388–1389 methods of inquiry, 1356 research questions, 1356 results, 1385–1387 in school achievement, 1353 secondary analyses, 1373–1376, 1387–1388 Gender disparities, 1027, 1034, 1035 Gender equality, 1370 Gender gaps, 1024, 1034, 1035, 1046, 1381 CIL, 1371–1373 civic and citizenship, 1370–1371 in mathematics, 1361–1364 reading, 1357–1359 in science, 1367–1369 Gender inequalities, 203 General Data Protection Regulation (GDPR), 476 Generalized linear mixed modelling (GLMM), 946 General pedagogical knowledge (GPK), 333, 340, 1169, 1170 Generic models, 1169 German COACTIV study, 1126 Gini coefficient, 202 Gini index, 203 Girls digital reading, 1359 education opportunities, 1353 mathematics performance, 1365 reading achievement, 1358 reading motivation, 1374 secondary schools, 1354 test scores, 1381 Global competence, 288, 620 Global Education Monitoring report, 420 Global Education Reform Movement (GERM), 77 Global international large scale assessments, 582 Globalization, 586 Global learning crisis, 63 Global Teaching InSights video study, 475 Grade approach, 290 Grand-mean centered variables, 878 Gross domestic product (GDP), 202, 1325 Group-adaptive design, 754 Group level adaptive testing in PIRLS 2021, 311–312 Group-level characteristics, 944, 946–947

1525 Group-mean centered variables, 878 Group of items group-level characteristics, 946–947 personal characteristics, 945–946 test-level analysis, 945 Group-score assessments, 773 Gulf Cooperation Council (GCC), 1023–1030, 1032–1036, 1044–1046 Gulf States educational aspirations, 1036–1040 educational systems, 1029–1030 gender disparities, 1034, 1035 gender gap, 1034, 1035 history of education, 1024–1027 nationality status, 1035–1038 primary and secondary level, overall performance on, 1032–1034 societal structure, 1027–1029

H Hardware environment, 596 Harmonized Learning Outcomes (HLO), 64 Headstart, 259 Health knowledge test (HAKT), 455 Heterogeneity, 1401 Hierarchical linear model (HLM), 881–883, 915, 1070 Hierarchical school system, 17–19 High-effort learners, 1218 Highest-achieving students, 970 Home possessions, 206 Home resources for learning (HRL), 210 Homework, 1211 academic achievement, 1216, 1217, 1221 age, 1213 class-group level, 1219 comparative evaluations, 1214 cultural component, 1212 family involvement, 1228–1230, 1232 feedback, 1226 international assessments, 1216 mathematics, 1214 percentage of students, 1221 PISA, 1222, 1230 scores, 1225 student achievement, 1222, 1223 students behavior, 1216 task, 1224, 1225 teachers role, 1223 teaching practices, 1226 TERCE evaluation, 1218 time, 1212–1214

1526 Homework (cont.) TIMSS, 1225 variables, 1220 volume/size, 1219, 1220 Human capital, 28, 30–32, 34, 35, 49, 605, 1437, 1451 Human Capital Index (HCI), 63 Human Development Index (HDI), 64, 358 Hypothetico-deductive model, 433

I IEA-ETS Research Institute (IERI), 630 Immigration law, 1028 Implemented curricula framework, 431 Implicit stratification, 671 Imprecise image, 675 Increased participation, 583–588 Independent variables, 821–823 Index of economic, social and cultural status (ESCS), 1463, 1468 Indicator model of school processes, 162 Individual student achievement, 1410–1412 Inference statistics, 661 Information and communications technologies (ICT), 291, 397 cross-curricular learning, 1275 cross-national contexts, 1278 definition, 1274 digital competence, 1276–1278 education, 1274 ICT familiarity questionnaire, 294 Innovative domain implements, 590 Innovative Teaching for Effective Learning (ITEL), 410 Input-output paradigm, 158 Input-process-output (IPO) approach, 16 Inquiry, in science education, 1137–1138, 1154–1156 inquiry as instructional approach, 1153–1154 inquiry as instructional outcome, 1154 measurement and analysis of inquiry, 1152–1153 PISA, 1143–1147 search and screening process, 1149–1150 TIMSS, 1139–1143 Inquiry research, 1156–1158 Institutional environment, 286, 1060 availability of resources, 1059, 1074 environmental adequacy, 1059, 1075 structural organization, 1059, 1074–1075 Institutionalized cultural capital, 521

Index Instructional clarity, 478 Instructional leadership, 1071 Instructional Practice Research Tool for Mathematics (IPRT-M), 488 Instructional quality, 609, 1057–1060, 1065, 1071, 1072, 1078, 1079, 1404 Instructional support, 484 Instrumental variable (IV) techniques, 1419 endogeneity problems, 821 estimating fuzzy regression, 824–825 graphical representation, 822 independent variables, 821–823 instrumental variable, 821 IV models with SEM, 823–824 multi-step procedure, 821 Instrumental variables, 42 Intellectual process, 184 Intended curricula framework, 431 Inter-American Development Bank (IDB), 284 International Adult Literacy Survey (IALS), 1454 International assessments, 702, 1216, 1385, 1387 civic knowledge, 716–717 common construct domain, 706 comparability, validity and reliability, 703–705 comparability and cross-cultural validity, 708–709 financial literacy, 715 problem solving and inquiry, 715–716 sources of incomparability, 709–710 supervision and quality assurance, 710–711 technology and complexity, role of, 711–713 International Association for the Evaluation of Educational Achievement (IEA), 80, 148, 176, 183, 184, 254, 283, 386 conceptual framework, 433 foundational curriculum model, 161 studies, 421, 428 International Civics and Citizenship Study (ICCS), 178, 283, 612, 667, 703, 706, 717 International classification schemes, 217 International Comparative Analyses of Learning (ICALT) system, 487 International Computer and Information Literacy Study (ICILS), 178, 179, 182, 257, 283, 667, 703, 706, 708, 709, 1454, 1275 CIL/CT, 1280, 1281, 1290–1292 contextual variables, 1282, 1283

Index CT, 1281, 1282 development/design, 1279 digital learning, 1286, 1287, 1289 interpretations/implications, 1293, 1294 results, 1285, 1286 International Congress for School Effectiveness and School Improvement, 260 International databases, 294, 295 International Data Explorer (IDE), 298 International Early Learning and Child Wellbeing Study (IELS), 284 International large-scale assessments (ILSAs), 4, 204, 207, 254, 255, 280, 281, 284, 292, 295, 296, 298, 299, 326, 327, 370, 512, 548, 604, 660, 702, 832, 838–841, 1022, 1054, 1055, 1061, 1063, 1065, 1067, 1068, 1136, 1137, 1139, 1153, 1155, 1156, 1158–1161, 1324–1326, 1352, 1434, 1435, 1444, 1451–1453 accomplishments and limitations, 6, 7 accountability systems, 88 accounting for cluster structure, 636 achievement distributions, 692 adaptive nature of teaching, 637 added value, 593–597 administrative accountability, 81, 82 advancements, 268–270 advantages, 58 affective-motivational facets, 1173 agenda setting, 57 aims, 428–429 Asian education, 58 assessed achievement domains, 1500 assessment domains, 286, 287 assessment implementation, 631–632 awareness and student motivation, 1042 background questionnaires, 688 background/contextual data, 288 benchmarking, 101 boosting theory development in education, 607–610 causal analysis, 1504 causal inferences in educational research, 804 challenges, 299, 633–637 changes in curricula, 591–592 changing behaviour, 105–107 changing cognition and behaviour, 104, 105 CIPO taxonomy, 519 civic education, 612 classroom video capturing, 473 classrooms, teachers and curricula, 11, 12, 1509–1511

1527 cognitive and motivational constructs, 622–623 cognitive outcomes, 611–613 cognitive skills assessment, 19 comparability of target populations across countries, 625–627 comparability over time due to assessment changes, 634–636 complexity, 297 comprehensive conceptual modelling of educational outcomes, 610–619 comprehensiveness, 85 computer literacy, 613 computer-administered tests, 66 conceptual accomplishments and limitations, 607–629 context questionnaires, 724 Context-Input-Process-Output (CIPO) model, 16, 17, 519–520 contextual data, 293 controversial policy issue, 59 correlation and causation, 805–806 counter-reference societies, 58 countries and economies, 65 covering vs. not covering topical policy, 739, 740 criticisms of causal approaches, 826–827 cross-country differences, 1504 cross-cultural differences, 83 cross-cultural measurement, 698 cross-national indicator, 85 culture, 299 curricular levels, 16, 17 curriculum orientation, 582, 1041–1042 cycles, 300 data analysis, 9, 10 data collection, 292 data driven policies, 580 data reuse, 297 data sets, 295 definition and operationalisation of constructs, 620–622 definition, 98 design and implementation issues, 9, 10 design of, 433–438, 687 development of technical approach, 631 DICE study, 559–567 domains of assessment, 518 domains, 286 early childhood and adult, 614 EBP, 120–123 economies, 62 education policies, 57, 69

1528 International large-scale assessments (ILSAs) (cont.) education research, 285 education systems, 65, 80–83 educational accountability, 1496–1499 educational effectiveness, 16, 1499 educational outcomes, 86 educational researchers, 21–23 educational systems, 687 elaboration, 58 empirical accomplishments and limitations, 640–645 endogeneity problem, 807–810 endogenous privatization, 61 equity and diversity, 11, 12, 1513–1515 ESCS indicator, 689 estimation precision, highest and lowest performing countries, 633–634 evidence-based policy-making, 62 evidenced-based decision making, 89 financial and human resources, 56 financial literacy, 613 for governments’ engagement, 61 foreign language, 79 framework document, 288 frameworks of, 147, 429 frequency and scale of studies, 438 functions and characteristics, 6, 7 future directions, 588, 592 Gateway, 300 governmental agency, 80 grade-based, 582 growth, 59, 62, 84 Gulf States’ participation in, 1030–1040 HCI, 64 hierarchical school system, 17–19 history, 581–583, 605–607 HLO, 64 holistic approach, 113, 114 human capital, 63 human development, 63, 64 ICCS, 706 ICILS, 706 idiosyncrasies, 59 IEA conceptual framework, 433 IEA/OECD, 696 IEA’s PIRLS assessment, 689 implementation and expansion, 627–629 implementing educational reforms, student achievement, 1042–1043 implications for future research, 537–538 increased heterogeneity in, 750–755 increased participation, 583–588

Index induced-participation, 62 information, 297 infrastructure and researcher training, 629–630 innovation, 63 innovative item types vs. traditional items, 741–744 input-process-output (IPO) approach, 16 instructional practices and policies, 80 international assessments, 723 international community, 63 international database, 294 international student mobility, 61 intranational tests, 82–84 issues specific to multilevel modeling, 890–894 item development, 722 items and constructs comparison across countries, 623–625 jurisdictions, 81, 84 knowledge-based regulation, 57, 81 leadership, 88 learning, 59, 67 lock-in-effect, 102 low-and middle-income countries, 615–617 low correlations, 690 lower-and middle-income economies, 80 lower performing educational systems, 690 macro-policy purposes, 102 market forces, 69 market-based model, 82 mathematics teacher, 327 measurement error, 84 measurement problems, 690 measurement, 62, 66 mechanisms, 70 meta-perspective on, 4–8, 1496–1503 methodological accomplishments and limitations, 629–640 methodological complexities, 639 methodological limitations, 115 methodological nationalism, 69 methodologies and assessment strategies, 86 models of causality, 810–811 monitoring and thermometer tool, 80 multi-group and incidental multilevel data structures, 892–894 multilevel regression modeling (see Multilevel regression modeling) multilevel structural equation modeling (see Multilevel structural equation modeling (MSEM)) multistage cluster sampling, 1503

Index municipalities, 62 NAEP, 87 national educational policy and research, 9, 10 national educational systems, 60, 69 national samples, 291 national/regional studies, 285 national-level attempts, 57 neoliberal reforms, 69 network governance, 70 non-cognitive characteristics, 611 non-cognitive skills, 84 non-state actors, 70 objectives, 285 OECD, 109–113, 430–431, 686, 688, 694 organizational systems, 57 outcomes measurement, 67 outcomes-based accountability, 79, 86 outcomes-based governance, 57 par excellence, 67 participation, 56, 63, 70 pioneers, 686 PIRLS, 705 PISA 2015 vs. TIMSS 2015, 67, 68 PISA-D, 695, 696 plausible-values, 890–892 policy decisions, 60 policy, media and, 644–645 policymakers, 56, 58, 59, 70, 81, 85, 99, 115, 580 policymaking, 101, 1041 policy-orientation, 99 political decisions, 70 political economy, 1497 political program, 60 politically and culturally balanced, 20 potential and limitations, process data, 947–948 power of numbers, 103 predictors vs. conceptualizing questionnaire, 732–734, 737, 738 process data (see Process data) process or log file data, 698 production process, 146 professional development, 1174 proficiency distributions, 691 public education, 62, 642–643 quantitative research designs, 300 quantitative vs. qualitative research designs, 1502 questionnaire development process, 724, 725 questionnaires, 744

1529 randomized experiment, 806–807 recommendations, 589, 593 reference societies, 58 regional assessments, 65 reliability, 84, 85 reporting and development, 745 research findings, 147 resources, 88 response bias, 1504 response style bias, 688 robust inferences, 804 sampling schemes, 516–518 scales/indices vs. standalone items, 726–728 scholars, 686 school administrators, 83 school climate, 1507–1509 school effectiveness, 16 school feedback, 1042 school learning (see School learning) school performance, 88 school quality and non-school factors, 66 school systems, 59 school-level constructs, 11 school-related, 282, 291, 300 schools/classrooms, 108, 109 secondary analysis, 298, 1055 self-referential systems, 57 setting education and schooling back on the agenda, 640–642 simplification, 58 societal role, 5, 7 society values, 86 socioeconomic inequality, 1500 socio-economic status-related constructs, 521 sociological systems-theory, 57 special cases, 124–125 speed of communication, 581 stakeholders, 81, 88 STAR experiment, 807 state bureaucracy, 69 state monopoly, 69 statistical findings, 10–11 status-based systems, 84 stratified multi-stage sampling, 604 strengthening measurement quality, 630–632 structural changes, 928 student characteristics, 19 student learning outcomes, 20, 1500 student performance, 84 student population, 1502

1530 International large-scale assessments (ILSAs) (cont.) student’s competences and dispositions, 12, 1511–1513 study cycle, 281 study frameworks, 589–590 study of equity with, 1436–1443 study-related objectives, 286 survey development, 741, 744 survey weights, 894 surveys vs. measuring, 728, 730–732 systematic reviews, hierarchical approach, 874–880 system-theoretical scholars, 57 target population, 289 teacher education, 328, 1173, 1179 teacher training, 1042 teachers/principals, 1502 teaching experience, 1174 teaching quality, 473, 1091–1094 technical standards, 66 technical-vocational education, 61 technological developments, 580 test-based accountability, 83, 88 test information curve, 693 test-overlap and non-overlap, 66 theoretical foundations, 6, 7 theoretical models, 167 TIMSS, 100, 706, 723, 724 trend analysis, 1505 trends in educational outcomes and boosting longitudinal studies, 617–618 universal descriptors and scales, 525 validity, 83 value and credibility, 102 video classroom observation, 471 workshop, 298 International Quality Assurance Program, 711 International Quality Control Monitors (IQCMs), 711 International Reading Literacy Study (IRLS), 162–163, 187 International sampling design, 661 International Socio-Economic Index (ISEI), 206 International Standard Classification of Education (ISCED), 206, 290, 666 The International Standard Classification of Occupations (ISCO), 983 International Study of Achievement in Mathematics, 156 International System for Teacher Observation and Feedback (ISTOF) observation system, 487

Index International video studies, 471, 472 Interrater reliability, 493 Intraclass correlation coefficient (ICC), 676, 1008, 1009 Intrinsic motivation, 1305 IRT based model, 1505 Item bias, 848 Item calibration, 315–316 Item Equivalence Study, 595 Item-level analysis data mining techniques, 944 and group-level characteristics, 944 single problem-solving, 943 Item-response-theory, 523 Item response theory (IRT), 295, 757, 760, 761, 780, 850, 860, 861, 1006, 1285

J Jackknife 1 method, 795 Jackknife 2 method, 796

K Knowledge capital, 29, 30, 36, 37, 39, 40, 43, 46

L Language games, 487 Large-scale assessment, 29, 298, 421, 707, 714, 717, 1062 adaptations (see Adaptations) context of theory, 982 description, 772 educational objectives, 982 education reform programs, 986 extending survey documentation, 986–987 functions of, 774–775 goals of, 773 longitudinal component, 988 longitudinal or mixed-method components, 981 measurement variance, 788–789 plausible values a, 779–788 quality of instruction, 985 replicate weights, 794–798 sample size and reporting requirements in, 775–778 sampling variance, 793 sampling weights, 789–793 social heterogeneity, 983–984 student achievement, 980

Index Taylor series approximation method, 798 teaching and learning, 984–985 uncertainity in, 778–779 adapting, 986–987 mixed-method component, 989 national and with international research, 982 Large-scale educational evaluations, 1219 Large-scale video study, 471 Latent decomposition of variable, 885 Latent regression model, 780 Latent variable modelling, 604 Latin American Laboratory of Evaluation of Quality in Education (LLECE), 1228 Leadership, 1057, 1071 League tables, 1499 Learner’s Perspective Study (LPS), 474 Learning Adjusted Years of School (LAYS), 64 Learning environment, student body, 1402 characteristics, 1402 composition, 1402 educational system, 1403 instructional quality, 1404 peer processes, 1404, 1406 school resources, 1404, 1406 Learning metrics Partnership, 526 Learning outcomes, 271 Learning processes, 151 Least sum of squares method, 781 Legitimization, 1026 Less Difficult TIMSS, 290 Lesson studies, 1179 Liberal societies, 204 Lindström’s model, 555 “Linear” optimization, 225 Linear regression, 208, 784, 971 Linking Instruction and Student Achievement (LISA) Nordic study, 475 Linking successive assessments concurrent calibration, 316–318 conditioning, 315–316 item calibration, 315–316 plausible values, 315–316 Literacy and numeracy skills, 1434, 1454 Literacy Assessment and Monitoring Program (LAMP), 585, 695 Live classroom observations, 257 Local capacity, 1026 Log data, 1506 Log file data, 594, 598 Log files finer-grained information, 933 paradata, 929

1531 student eye movement, 931 test performance, 935 torrent of data recorded, 949 Logical rigour, 341 Long academic track, 958 data and methods, 965–966 research questions, 965 results, 966–968 student progress, 973–974 students’ achievement (see Students’ achievement) transition, 968–969 Longitudinal Classroom Environment Study, 617 Longitudinal component, 988 Longitudinal data, 270 Low achievers, 1413 Low-achieving students, 1416 Low-and middle-income countries (LMICs), 121, 122, 124 Low dispersion, 1421 Lower secondary education, 957–961, 970 Lower-secondary mathematics teachers, 362 Lower secondary school, 357

M Madrid, 998 Magnification tools, 595 Male advantage, 1362 students, 1372 superiority, 1353 Master rating procedures, 492 Material–Economic Dimension, 1334 Mathematical ability, 968 Mathematical Education of Teachers II (MET II), 365 Mathematical literacy/numeracy, 1444, 1445 Mathematical Quality of Instruction (MQI), 487 Mathematical skills, 968 Mathematics, 1434, 1437–1440, 1453, 1454 Mathematics, Science, and Reading, international achievement eighth grade mathematics, 1261–1265 equity, 1253, 1255, 1264 fourth grade science, 1253 fourth grade students, 1254–1256, 1258–1260 IEA, 1244 international benchmarks, 1247 low and high international benchmarks, 1248–1250, 1252

1532 Mathematics, Science, and Reading, international achievement (cont.) minimum proficiency level, 1249, 1251, 1258, 1269, 1270 PIRLS, 1245, 1269 primary school achievement, 1256, 1257 primary school mathematics, 1248 PRLS 2016, 1246 proficient level, 1254 SDG4, 1245 secondary school achievement, mathematics, 1261 secondary school achievement, science, 1264–1268 TIMSS 2015, 1246, 1269 Mathematics, 1381 country level, 1380–1381 student level, 1379 gender gaps, 1381 Mathematics achievement, 813–816 Mathematics content knowledge (MCK), 333, 339 Mathematics-intensive occupations, 358 Mathematics literacy, 451 Mathematics pedagogical content knowledge (MPCK), 333, 339 Mathematics-related beliefs, 341 Mathematics teacher, 327–329, 331–333, 336, 337, 340, 359, 361 Mathematics Teaching in the 21st Century Study (MT21), 329 Measurement precision, 633 Measure of size (MOS), 670, 677 Measuring well-being, 1328 Meta-analyses, 1210 Method bias, 847–848 Methodological advancements, 1401 Methodological complexities, 639 Metric equivalence, 849 Middle-income countries, 713–714 Migration background achievement gap, 1474–1475 effect of region of origin, 1475–1476 functions of, 1474 nativeness, 1473 operationalisation, 1473 at school level, 1476–1477 variation in SES effect, 1477 Millennium Development Goals, 29 Ministry of Foreign Affairs and Trade (MFAT), 126 Mixed-method component, 989 Models of causality, 810–811

Index Modern item response theory, 455 Modernization, 1026 Mother’s educational attainment, 1445–1450 Multilevel analysis, 1000 Multilevel confirmatory factor analysis, 886–887, 889, 895 examples, 901–905 model fit and comparisons of, 902, 906–907 parameters of, 903–904 Multilevel latent profile analysis, 909–914 Multilevel mediation models, 886, 905–909 Multilevel modeling approaches, 876 examples, 894–914 framework, 880 1-1 mediation model, 895 2-2-1 mediation model, 896 multi-group and incidental multilevel data structures, 892–894 plausible-value technique, 890–892 survey weights, 894 Multi-level modeling, 343 Multilevel multiple indicators multiple causes (MIMIC), 1384 Multilevel regression modeling contextual effects model, 883 cross-level interaction models, 884 examples, 898 hierarchical linear model, 881–883 limitations of, 884 null model, 880–881 parameters of, 899–900 Multilevel structural equation modeling (MSEM), 1418 examples, 905–909 information criteria, entropy, and likelihood-ratio tests, 912 latent covariate contextual models, 885 latent decomposition of observed variables, 885 multilevel confirmatory factor analysis, 886–889 multilevel mediation models, 886 multilevel mixture models, 889–890 parameters of, 910–911 structural models in, 889 with complex structural relations, 897 Multiple-group confirmatory factor analysis (MGCFA), 343, 850, 857, 858, 860–862 Multiple levels, 1422 Multiple-matrix sampling, 789, 798

Index Multiple stage cluster sampling, 668, 673, 681, 1503 Multistage test (MST) design considerations, ILSAs, 750, 755, 756, 762 advantages, 755 module length and position effects, 760–762 routing decisions, 757–759 routing methods, 760

N National Adaptations, see Adaptation National Assessment of Education Achievement (NAEA), 88, 121 National Assessment of Educational Progress (NAEP), 36, 83, 315, 604, 642, 712, 777, 1277 National Center for Education Statistics (NCES), 298 National context questionnaire, 294 National extensions, 982, 990 Nationalization programs, 1028 National project manager (NPM), 281, 710 National research coordinators (NRCs), 310, 980 participating entities, 280–282 PIRLS literacy, 283 prePIRLS, 283 Reading Literacy Study, 283 Six Subjects Study, 283 TIMSS Numeracy, 284 National Science Education Standards, 1137 National Technical Teams (NTTs), 134 Nearest-neighbour PSM, 970 Neoliberal view of education, 587 Nested data structure, 880, 881, 890 New public management (NPM), 77, 605 No Child Left Behind Act (NCLB), 78, 87 Non-aesthetic experiences, 550 Nonresponse, 668, 671–673, 681 Null model, 880–881

O Objectified cultural capital, 521 Observed or latent variables, 825–826 Occupational Status, 206 OECD Global Teaching InSights (GTI) video study, 475 OECD Starting Strong Survey (2018), 166–167 Omani Education System, 1043–1044 Omani Ministry of Education, 1028, 1035, 1043

1533 Omanization, 1028 Omitted prior achievement bias, 270 Operating system, 596 Operationalization, student composition, 1406, 1407 Opportunities to learn (OTL), 149, 328, 329, 338, 359, 523, 604, 608, 618, 641, 949, 1175, 1501 assessment of content validity, 229 assessment/analytical frameworks, 233–235 conceptual framework, 223–225 concurrent validity, 237–238 convergent validity assessment, 229–231 effects, 223 eye to content validity, 237–238 face validity, 238 general pedagogy, 339 indicators, 360, 361 intended, 359 job tasks, 339 latent class analyses, 360 lower-secondary teacher education, 361, 362 mathematical content, 246 mathematics education, 339 measures in TIMSS and PISA, 231–232 medium/short duration, 361 meta-analyses, 248 method, 229 previous research, 223 primary teacher education, 360, 361 quality scales, 339 random error, 247 research approach, 227–228 research problem, 222–223 research questions, 227–228 school level, 339 similarities, 236 student performance, 238–245 students’ recollections, 246 subdimensions, 339 tertiary level, 338 TIMSS and PISA, 233, 247 validity, 226–227 Organisation for Economic Cooperation and Development (OECD), 79, 148, 176, 255, 327, 1324 day-to-day management, 414 definition, 380 PISA, 381, 413, 415 TALIS, 415, 416

1534 P Pacific Forum Education Ministers’ Meeting (FEdMM), 127 Pacific Islands Literacy and Numeracy Assessment (PILNA), 124, 429 dissemination strategies, 132–133 education systems, 126 EQAP, 125 formal primary education, 125 policy making process, 126–128 quality of the assessment programme, 131–132 research capacity, 128–131 stakeholders, 131–132 Paper-based assessments, 292, 634 Paradata, 929 Parental education, 968 Parental educational attainment and occupational status (PISA), 431, 433, 438, 832, 833, 835–839, 841, 998, 1000, 1334, 1434, 1436, 1441, 1442, 1444, 1446–1451, 1453–1455, 1463, 1468, 1469, 1475 Parental occupation, 521 Partial Credit Model (PCM), 1289 Participating jurisdiction, 776 Pearson’s correlation, 208, 1213, 1216, 1217 Pedagogical content knowledge (PCK), 1169 Peer processes, 1405 Performance assessment, 190 Peripheral vision, 328 Personal characteristics, 945–946 Phantom effects, 1418 Phronesis, 551 Physical dimension, 1336 Physical safety, 1058, 1071 Pilot Twelve-Country Study (1960), 155–156 PISA for Development (PISA-D), 713, 754 PLATO system, 484 Plausible values (PVs), 231, 315–316, 779–788, 890–892, 1068, 1070, 1077 Play Drama and Thought, 554 Policy borrowing research, 56, 69 Policy discourses, 984 Policy Knowledge and Lesson Drawing in an Era of International Comparison (POLNET), 60 Policy making process, 121, 1041 Policy-oriented reporting, 1499 Policy research, 1434 Political economy, ILSAs in education, 1497

Index Predictive model-based forecasting approach, 1506 PrePIRLS, 433 Preprimary project, 165–166, 256 Pre-test data, 257 Primary sampling units (PSUs), 778 Primary school, 357 Principal component analysis (PCA), 859 Principled Assessment of Computational Thinking (PACT), 1278 Principle of consistency, 150 PRISMA flowchart, 1099 Private education, 1029 Private schools, 1029 Probabilistic misrouting, 758–760 Probabilities proportional to size (PPS), 291, 439, 670 Problem solving, 558, 1444, 1445, 1447, 1451, 1454, 1455 Problem-solving competency, 594 Problem Solving in Technology-Rich Environments (PSTRE), 403 Process data analysis, 937 approaches and strategies, 937–938, 943 computer-based assessment, 930 data management protocol, 929 definition, 929 ecological framework, 934–936 empirical analysis, 939–942 eye-tracking movements, 930 first studies to be published, 938 group of items, 945–947 heterogeneously-structured content, 930 information, 932–933 item-level analysis, 943–944 potential and limitations, 947–948 response process, 933–934 scientific papers, book chapters and working papers, 939–942 screenshot, 931 semi-processed, 931, 932 test-taker actions, 929 Process of inquiry, 341 Production-of-education framework, 1437 Professional competence, 1168 Professional development, 1058, 1072 Professional learning, video capturing in, 499–501 Proficiency levels, 208 Programa Regional de Indicadores de Desarrollo Infantil (PRIDI), 284

Index Program coherence, 339 Program for International Assessment of Adult Competencies (PIAAC), 178, 204, 585, 695, 850, 859–861, 934, 1434, 1436, 1441, 1442, 1444, 1446–1451, 1453–1455 adult literacy, 399 assessment frameworks, 399, 400 definition, 393 design/methods, 394 development, 402 direct assessment, 397, 398 educational attainment, 402 information, 396 periodicity/country participation, 394, 395 relationship, 399 Programme for International Student Assessment (PISA), 35, 56, 79, 99, 163–164, 178, 184–186, 188, 258, 283, 380, 582, 606, 663, 702, 705, 707–711, 713–716, 722, 750, 751, 754, 755, 928, 1055, 1060–1062, 1064–1069, 1071–1077, 1125–1127, 1136, 1137, 1148, 1150–1152, 1156–1159, 1161, 1162, 1324, 1356, 1434, 1436, 1438, 1440, 1442, 1444–1448, 1451, 1453, 1454 age-based sample, 582 and TIMSS, 229 assessment framework, 384, 385, 392 assessment mode, 523 attitudinal constructs, 522 categories, 230 comparison, 237 countries/economies, 382 country level trend comparison, 535–537 country variance, 514 definition, 381 design, 385 disciplinary climate, 1109 divergent and convergent findings, 527–528 domains of assessment, 518 education systems to educational outcomes, 588 empirical science proficiency distributions, 752 framework, 235 high performance/equity, 389 high-performing education system, 390, 391 IEA studies, 584 ILSAs, 386

1535 inquiry as instructional approach, 1143–1144 inquiry as instructional outcome, 1144–1147 instructional quality, 609 investment threshold, 389 linking method, 523–525 literacy, 606 longitudinal design, 531 low performers, 387 matching data, 529–530 math achievement, 229 math teachers per school, 228 mathematics achievement, 242 mathematics framework, 236 measurement invariance, 1122–1123 OECD countries, 750, 751 opportunities and limitations, 1118 OTL measures, 227 periodicity and country/economy participation, 381 PISA 2000 (reading), 1110 PISA 2000, 751 PISA 2003 (mathematics), 1110–1111 PISA 2006 (science), 1111 PISA 2009 (reading), 1111–1112 PISA 2012 questionnaire framework, 1112 PISA 2015 science country proficiency distributions, 764 PISA 2015, 751, 752 proficiency level, 387, 388 questionnaire scales, 522–523 reliability, 1122 research on science teaching, PISA 2015, 1120–1122 research on teaching mathematics, PISA 2012, 1118–1120 sampling schemes, 516–518 secondary analyses studies, 1125 student outcomes, 1126 survey design/methods/instruments, 381–383 teacher support, 1109 teaching quality, 1108–1124 test design, 393 test, 518, 520 vs. TIMSS achievement on country level, 532–535 total test information curve, 753 validity, 1123–1124 Programme for the Analysis of Education Systems (PASEC), 178, 283, 422, 438, 439, 462, 850, 856

1536 Progress in International Reading Literacy Study (PIRLS), 79, 178, 187, 257, 283, 386, 666, 703, 705, 711, 722, 1009, 1012, 1013, 1055, 1060–1062, 1064–1069, 1071, 1075, 1076, 1079, 1325, 1434, 1440, 1442, 1453 assessments, 313 booklet rotation procedure, 1005 Committees of International Experts, 309 cycle, 584, 983 data analysis, 1006–1007 diversity, 584 encyclopedias, 308 ePIRLS 2016, 594 Germany, 984 international study center, 591 items, 310 learning environment and student achievement, 983 long-term trends, 306–307 national instruments, 987 NRCs role, 310 number of items, 311–312 participation, 1001 performance regression, 1008 predictor variables of academic achievement, 1003–1004 prePIRLS countries, 585 reading comprehension test, 1002–1003 results, 1008–1011 Progressive taxation, 204 Propensity score matching (PSM), 1419 control variables, 971 descriptive statistics, 971 early tracking, 963 nearest-neighbour, 970 parallel samples, 960 Propensity scores, 811 Psychological dimension, 1330, 1331 Psychology, 259 Psychosocial outcomes, 1415, 1422 Public education, 1029 Public goods, 203 Public sector functions, 76 Public-use files (PUF), 296 Pythagoras study, 609 Q QAIT model, 150 Qualification, 1026 Qualitative research, 565 Quality/equity in education, 255, 264, 266, 268, 272

Index Quality assurance, 710, 711 Quality dimension, 267 Quality in Physics study (QuIP), 474 Quality of instruction, 985 Quality of interpersonal relationships, 1058 Quality of opportunity, 159 Quality of Science Teaching (QST), 487 Quantitative analyses, 866 Quasi-experimental techniques, 1000 Quasi-longitudinal design, 290

R Random effects model, 878, 879, 881, 888 Randomized assignment, 1419 Randomized experiment, 806–807 Random samples, 661 Random-start fixed-interval systematic sampling, 291 Rasch item response theory, 450 Rasch model, 639 Rasch scaling, 455 Reading, explanatory variables, 1381 the country level, 1377–1378 15-year-old students, 1360–1361 in fourth-grade students, 1357–1360 gender gaps, 1378 at student level, 1374 at teacher and school level, 1377 Reading comprehension test, 964, 1002–1003 Reading literacy, 1444–1447 Reading skills, 1434, 1447 Receive close attention, 1357 Regional studies in Africa, 422 emergence of, 420 and international studies, 424 SACMEQ (see Southern and Eastern Consortium for Educational Quality (SACMEQ)) SEA-PLM, 461 Regression, 1439, 1440 Regression analyses, 243 Regression coefficients, 780, 822, 882 Regression discontinuity design (RDD) OLS regression, 818 regression discontinuity design, 817–820 sharp and fuzzy, 817 test-score or age, 817 Regression discontinuity designs, 818–820, 1420 Regression equation, 781

Index Reliability, 704, 1122 Repository information, 297 Resilience, 1438 Research fragmentation, 621 Respect for diversity, 1058 Response bias, 1504 Restricted maximum likelihood estimation (REML), 879 Restricted-use files (RUF), 296 RLSR20, 759, 760 Root mean squared deviation (RMSD), 694 Root mean squared error (RMSE), 758, 760 Rotated booklet design, 631

S SAMDEM software, 449 Sampled students, 674 Sampling, 662 error, 662 strategies, 662 Sampling precision, 660 Sampling techniques, 291 Sampling variance, 662, 663, 671, 673, 675, 676, 679, 793 Sampling weights, 789–793 Scalar equivalence, 849 Scale reliability, 859 School climate, 1507–1509 academic climate, 1057–1059, 1071–1073 aspects, 1057–1059 coding and data extraction, 1062–1063 community, 1058, 1059, 1073–1074 connectedness, 1058, 1073 data and samples, 1067–1069 discipline and order, 1071 eligibility criteria, 1062 emotional safety, 1058 ILSA-related research, 1075–1077 institutional environment, 1059, 1074–1075 methodological appropriateness, 1068–1070 multidimensionality, 1056–1057 number of investigations, 1066, 1067 order and discipline, 1059 partnership, 1074 pattern of findings, systematic of ILSAs studies, 1077–1079 physical safety, 1058 quality of interpersonal relationships, 1058 relationships, 1073 respect for diversity, 1058, 1073 screening and selection process, 1062–1063

1537 search procedures, 1061–1062 student, teacher/classroom and school levels, 1064 and student outcomes, 1059–1060 School effectiveness, 16 School Effectiveness and School Improvement, 260 School effectiveness research (SER), 149, 151 School environment, 1054, 1055, 1061 School learning cyclical studies, 154 dynamic perspectives, 152–154 educational effectiveness, 150 empirical foundation, 154 input-output to input-process-output paradigm, 149–150 multi-level and multi-faceted structure, 148 psychological theories, 148 research on teaching and learning, 149 socio-ecological perspective, 151–152 state-of-the-art theories, 154 School-level factors, 267 School Organization and Student Achievement, 157 School resources, 1406 School SES composition effects, 1470–1472 Science achievement country level, 1382–1384 gender gaps, 1384 individual and family level, 1382 teacher and school level, 1382 Scientific inquiry, 1136–1138, 1141–1143, 1147, 1148, 1152–1154, 1157, 1158 Scientific literacy, 286 Scoring procedures, 491–492, 593 SEACMEQ Coordinating Centre (SCC), 443 SEA-PLM, 429 Secondary analyses, 280, 295–300 Secondary school set, 958 Second Information Technology in Education Study (SITES), 283 Second International Mathematics Study (SIMS), 160–161, 183 Second International Science Study (SISS), 160–161, 186 Segregation, 1377 Self-administered questionnaires, 293 Self-confidence, 1305 Self-determination theory, 1302 Self-efficacy, 1151, 1302, 1372 Self-reported high-school achievement, 342 Self-reports, 258 Self-weighting design, 670, 671

1538 Signal-to-noise ratio, 949 Signature pedagogies, 331 Simple random sampling (SRS), 667 Six key purposes, 129 The Six-Subject Study (Bloom 1969), 158–160 Six Subject Survey, 256 Skills Towards Employability and Productivity (STEP), 178, 585, 695 SLRR19, 758, 760, 761 Social dimension, 1331, 1332 Social-emotional climate, 1070 Social heterogeneity, 983–984 Social-political-philosophical system, 159 Social Science Citation Index, 343 Social stratification, 206 Societal and historical context, 1389–1391 Socio-ecological perspective, 151–152 Socio-ecological theory, 150 Socioeconomic gaps, 205 Socioeconomic groups, 205 Socioeconomic inequality absolute gap, 216 achievement gap, 214, 215 categorical measures, 207, 208 continuous measures, 207, 208 different roles, 204 distributive rules, 203 implicit assumptions, 205 indicators, 217 instrumental value, 204 methodological advantages, 217 parental education, 218 performance gap, 214 philosophical literature, 202 recommendations, 217 relative gap, 214, 216 standard deviation, 214 standardization, 209 thresholds, 209 TIMSS, 215 Socioeconomic inequality, in ILSAs, 1496–1503 Socio-economic status (SES), 203, 259, 332, 342, 965, 1155, 1294 academic achievement, 1467 books, 207 Bourdieu’s Cultural Capital Theory, 1464 comparable coding, 206 correlations, 211 criticism, 207 cultural capital effect on individual achievement, 1469–1470 data/varibales, 209, 210

Index education and, 1435–1445, 1447, 1449–1453 educational qualifications, 207 in educational research, 1462 ESCS, 1463, 1469 family cultural capital, 1466 function of, 1464 home possessions, 206 indicators, 1462, 1468–1469 lawnmower, 206 Marxist theory of social stratification, 1464 mathematics achievement, 212, 213 measurement of, 1462 on migration status, 1477 multi/unidimensional construct, 207 multi-dimensional construct, 1464 multilevel logistic analysis, 1467 multiple-nation studies, 1466 parental education, 207 school composition effect on student outcome, 1470–1472 single indicator, 1464 single-nation studies, 1466 social categories, 218 three-part model, 1462 TIMSS and PIRLS, 1463 unidimensional composition, 1463 Sociological theories, 1210 Sociology, 259 Southeast Asian Ministers of Education Organization (SEAMEO), 133 Southeast Asia Primary Learning Metrics (SEA-PLM), 124 capacity building activities, 134 description, 133 dissemination to stakeholders, 135–137 policy making process, 134–135 student academic outcomes, 133 Southern African Development Community (SADC), 443 Southern African Consortium for Monitoring Educational Quality (SACMEQ), 79, 283, 423, 440, 850, 856 aims and objectives, 442–443 assessment framework by SACMEQ I for reading, 451 assessment framework in SACMEQ II for pupils mathematics test, 452 competency levels from Namibia in SACMEQ II and III, 455 co-ordination of, 443 data collection and analysis, 454–457 design of studies I-IV, 447–448

Index evaluation of, 462–463 framework, 444–446 governance, 443 hypothesised two-level model of pupil achievement for SACMEQ III, 446 management, 444 policy-relevant analysis of questionnaire data for Zanzibar in SACMEQ III, 458 policy suggestion from SACMEQ I for Mauritius, 445 population and sampling, 447–449 questionnaire analysis, 458–459 questionnaires, 454 reading and mathematics achievement scores, Tanzania, 456 report on HIV and Aids test on country-level to Kenya in SACMEQ III, 457 SACMEQ IV test, 453 self-governance, 461 studies, 441 tests, 450–454 Specialized Institute for the Professional Training of Teachers (SIPTT), 1044 Specialized statistical analysis methods, 295 Sputnik shock, 329 Stage dimension, 267 Stakeholders, 131–133, 135–137 Standard deviations (SD), 1355, 1357, 1359 Standardized regression coefficients, 243 The STAR experiment, 807 State-of-the-art theories, 154 Statistical Yearbooks of Education, 958 Stratification, 671, 677, 678 Structural equation modeling (SEM), 262, 817, 823, 824, 1070 Structural organization, 1059, 1074–1075 Struggling learners, 1218 Student achievement, 209, 258, 306–307, 1401, 1402 Student Approaches to Learning (SAL), 1304 Student characteristics, 19 Student learning outcomes, 260, 265, 1500 affective-motivational facets, 1175, 1176 professional development, 1176 teacher education, 1175 Student level factors, 265 Student motivation, 256 Student motivation and self-beliefs cognitive or social-cognitive theories, 1301 complex constructs/confusing terminology, 1305 countries/cultures, 1313–1315 group differences, 1315, 1316

1539 IEA/OECD, 1302 ILSA, 1303, 1306–1309, 1311–1313, 1316–1319 motivation, 1301 motivational frameworks, ILSA, 1303, 1305 motivational patterns, 1310, 1311 reading contexts, 1309, 1310 self-efficacy and self-concept, 1303 Student-oriented practices, 270 Student outcomes, 20, 259, 494 Student participation, in content-related talk, 478 Student performance, 256 Students’ achievement, 969 academic track, 972 data and methods, 970–971 linear regression models, 974 long academic track, 971 proportion of students, 972 research questions, 970 Students’ Approaches to Learning scales (SAL), 964 Student’s behavior, 493 Students’ responses, 969 Students tasks, 1211, 1224, 1228, 1233 Student–student interactions, 484 Student-Teacher Linkage Forms, 987 Study design, 538 Study frameworks, 589–590 Subject-matter specificity, 487–489 Sub-Saharan Africa, 420 Supportive climate, 478 Supportive teaching, 1101 Survey domains, 286 Survey of Mathematics and Science Opportunities (SMSO), 161 Surveys of the Enacted Curriculum (SEC), 224 Sustainable Development Goals (SDGs), 28, 64, 271, 387, 421, 1245 Systematic measurement errors, 846 Systematic review, 1096–1108 Systematic sampling, 291, 671 System-Level Descriptive Information, 271 System-level factors, 271 System-level policies, 123 T Taker actions, 929 Target population, 289, 663, 666 Taylor series approximation method, 798 Teacher and Learning International Survey (TALIS), 328 Teacher beliefs, 334

1540 Teacher competence, 326, 1058 conceptualizations, 1169, 1180 domain/situation-specific, 1178 GPK, 1170 ILSA (see International large-scale assessments (ILSA) knowledge/skills, 1168 learning opportunities, 1179 notion, 1178 quality education, 1171, 1173 reflective practitioner, 1179 research, 1168, 1179 responsibilities, 1168 second phase, 1179 situation-specific skills, 1170, 1171 structure and development, 1176, 1177 student learning outcomes (see Student learning outcomes) TEDS-M, 1180 time lag, 1180 Teacher education, 328 Teacher Education and Development-Follow Up study (TEDS-Instruct), 475 Teacher Education and Development Study in Mathematics (TEDS-M), 257, 327, 614, 1168, 1176 beliefs, 333, 334 countries, 358, 359 curriculum analysis, 338 developments, 369 ILSA, 327 institutional survey, 338 international context, 328 knowledge domains, 333 levels, 329, 330 macro level, 330, 331, 357 meso level, 331, 359 micro level, 332, 333 national survey, 337 OTL, 338 publications, 344 research questions, 334 sampling, 335, 336 target population, 335, 337 teacher educator survey, 338 Teacher education program, 368 Teacher education systems, 981–982 Teacher evaluation, 497–499 Teacher knowledge, 328–330, 340 Teacher preparation units (TPU), 343 Teacher professional learning, 472 Teacher salaries, 358 Teacher’s behavior, 493

Index Teachers’ beliefs commitment, 1188, 1189, 1196, 1197 conceptualisation, 1193, 1195, 1196 data and methods, 1189–1191 history, 1186 ILSA studies, 1191, 1192 leadership, 1189, 1198, 1199 motivation, 1200 opportunities, 1200, 1201 professionalism, 1188, 1198 satisfaction, 1187, 1188 self-efficacy, 1187, 1199, 1200 value, 1198 work satisfaction, 1197 Teacher–student interaction, 478, 484 Teacher training, 1179 Teaching and learning, see Classroom teaching and learning assessment Teaching and Learning International Survey (TALIS), 164–165, 271, 284, 380, 667, 1157 appraisal system, 412 culture, 411 definition, 403 design/methods, 405, 406, 408 framework, 407 indicators, 413 instruments, 407, 413 international survey, 409 quality instruction, 410 school leaders, 413 standardised surveys, 404 survey periodicity/country/economy participation, 404, 405 teachers perceptions, 412 Teaching–learning process, 269 Teaching methods, 339 Teaching practices, 1211, 1219, 1225–1227, 1232 Teaching quality assessment, 475, 484 Teaching quality (TQ), in PISA classroom teaching, 1114 construct under-representation, 1093 cross-sectional study designs, 1092 defining and implementing measures of teaching, 1108–1110 domain-general variables, 1113 domain-specific variables, 1113 history, 1091–1092 inappropriate measurement level, 1092 integrating data, 1115–1118 longitudinal approaches, 1093 measurement invariance, 1122–1123

Index methodological complexities, 1093 PISA 2000 (reading), 1110 PISA 2003 (mathematics), 1110–1111 PISA 2006 (science), 1111 PISA 2009 (reading), 1112 PISA 2012 questionnaire framework, 1112 reliability, 1122 research on science teaching, PISA 2015, 1120–1122 research on teaching mathematics, PISA 2012, 1118–1120 validity, 1123–1124 Teaching quality (TQ), in TIMSS, 1097, 1107–1108 coding and data extraction, 1098–1099 construct under-representation, 1093 cross-sectional study designs, 1092 history, 1091–1092 inappropriate measurement level, 1092 instructional activities, 1100 instructional engagement, 1100 instructional practices and strategies, 1100 longitudinal approaches, 1093 measures of TQ, 1104–1105 methodological complexities, 1093 methods, 1104 outcomes, 1103 patterns of findings, 1105–1106 questionnaires, 1101–1102 samples, 1102–1103 screening process, 1098–1099 search procedure, 1097–1098 teaching styles, 1100 Teaching skills approach, 1113 Techne, 551 Technology and Engineering Literacy (TEL), 1277 Tercer Estudio Regional Comparativo y Explicativo (TERCE), 284 Test characteristics, 934 Test-level analysis, 945 and group-level characteristics, 946–947 and personal characteristics, 945–946 Theatre education, 552–558 Theatre-in-education (TIE), 558 The Rationality of Feeling, 553 Think-pair-share’ format, 127 Third National Development Plan, 1028 Three basic dimensions (TBD), of teaching quality classroom management, 1095 cognitive activation, 1095, 1096 supportive climate, 1095

1541 Time constraints, 122 TIMSS Videotaped Classroom Study, 257 Tracking Czech lower secondary education, 957–960 impact, 957 Transition to e-assessment, 313–315 Trend analysis, 1505 Trend reporting issues in, 838–839 PISA, 835–836 TIMSS and PIRLS, 834–835 trend results and policy reactions, 836–838 Trends in International Mathematics and Science Study (TIMSS), 35, 79, 87, 98, 123, 161–162, 178, 204, 256, 257, 284, 386, 438, 439, 450, 454, 473, 512, 668, 687, 703, 706, 711, 716, 722, 750, 752, 832–835, 837, 839, 840, 848, 850, 856, 859, 860, 1000, 1001, 1024, 1032–1035, 1041, 1044–1046, 1055, 1060–1062, 1064, 1124–1127, 1136, 1137, 1148, 1150–1152, 1156–1159, 1161, 1162, 1325, 1329, 1331, 1332, 1335, 1336, 1340, 1341, 1434, 1440, 1442, 1453, 1463, 1468, 1469, 1478 assessment, 307, 313, 590 attitudinal constructs, 522 2007 bridging, 319 2019 bridge, 320–321 cognitive domains, 225 Committees of International Experts, 309 comparison, 237 computer-based assessment, 523 2019 content and cognitive domains, 307 content validity, 231–232 convergent validity, 238–245 correlations, 241 country level trend comparison, 535–537 country variance, 514 cycle by income level, 584 divergent and convergent findings, 527–528 domains of assessment, 518 Encyclopedias, 308 grade 8, 515 IEA, 1244 implementation, 595 indicator in response, 1245 inquiry as instructional approach, 1139–1141 inquiry as instructional outcome, 1141–1143 International Benchmarks, 1245 international metrics, 582

1542 Trends in International Mathematics and Science Study (TIMSS) (cont.) items, 310 level playing field, 230 linking method, 523–525 longitudinal design, 531 long-term trends, 306–307 math teachers per school, 228 mathematics achievement, 242 meta-analyses, 248 minimum proficiency, 1247 and PIRLS, 583, 987 and PISA reports, 229 vs. PISA achievement on country level, 532–535 2023 rotated design, 312 NRCs role, 310 number of items, 312 numeracy, 754 questionnaire scales, 522–523 reliability, 247 sampling schemes, 516–518 secondary analyses studies, 1125 social disparities, 984 student outcomes, 1126 study design, 237 survey extensions, 983 survey, 581, 585 teacher qualifications, 981 teacher questionnaire, 234 teaching quality, 1096–1108 test, 518 TIMSS 2015 grade eight math achievement distributions, 753 TIMSS 2015 grade eight math item location distribution, 753 TIMSS 2015 grade eight math test information curve, 754 TIMSS 2018 math country proficiency distributions, 763 Two-stage process, 335 Two-stage sample design, 291

U U.S. National Assessment of Educational Progress (NAEP), 630 Unbiased samples, 665 UNESCO Institute for Statistics (UIS), 1245

Index United Nations Children’s Fund (UNICEF), 133 United Nations Educational, Scientific and Cultural Organization (UNESCO), 281 United Nations Sustainable Development Goals, 832 United Nations (UN), 281, 387, 587, 666, 832 Universal basic skills economic impacts, 46, 47, 49 global challenge, 44, 46 Universalization of modern schooling, 1025 Unobserved heterogeneity, 812 US National Assessment of Educational Progress (NAEP), 525

V Validity, 226–227, 702–704, 708, 709, 711, 715, 717, 1123–1124 Validity-Comparability Compromise, 475 Variable-centered approach, 1310 Variables, 258 Video classroom observation, 471 Video documentation, 499 Video studies, 495–496 Videotaped Classroom Study, 257 Vocational education, 1029

W Weekly homework, 1212 Weighting, 671, 681 WEI-SPS, 429 Well-being adults, ILSAs, 1340, 1341 child, 1326, 1327 cognitive dimension, 1329, 1330 ILSAs, 1327, 1328, 1337–1340 material economic dimension, 1335, 1336 questionnaire, 294 physical dimension, 1336, 1337 psychological dimension, 1330, 1331 social dimension, 1332–1334 Western education, 551 Western societies, 20 World Bank Group (WBG), 177 World Bank’s Strategy 2020, 421 World Education Forum, 582 The Written Composition Study, 611