The Handbook of Personality Dynamics and Processes [First edition] 0128139951, 9780128139950

The Handbook of Personality Dynamics and Processesis a primer to the basic and most important concepts, theories, method

215 49 36MB

English Pages 1384 [1395] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

The Handbook of Personality Dynamics and Processes [First edition]
 0128139951, 9780128139950

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

THE HANDBOOK OF PERSONALITY DYNAMICS AND PROCESSES

THE HANDBOOK OF PERSONALITY DYNAMICS AND PROCESSES Edited by

PROF. DR. JOHN F. RAUTHMANN Chair of Personality Psychology and Psychological Assessment Bielefeld University, Bielefeld, Germany

Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom © 2021 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-813995-0 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Nikki Levy Acquisitions Editor: Joslyn Chaiprasert-Paguio Editorial Project Manager: Barbara Makinster Production Project Manager: Kiruthika Govindraju Cover Designer: Matthew Limbert Typeset by SPi Global, India

Contributors

G. Leonard Burns Department of Psychology, Washington State University, Pullman, WA, United States

Jonathan M. Adler Olin College of Engineering, Needham, MA, United States Balca Alaybek Department of Psychology, George Mason University, Fairfax, VA, United States

Nicole M. Cain Department of Clinical Psychology, Graduate School of Applied and Professional Psychology, Rutgers University, Piscataway, NJ, United States

Jayne L. Allen University of New Hampshire, Durham, NH, United States Jens B. Asendorpf Department of Psychology, Humboldt University of Berlin, Berlin, Germany

Erica Casini Department of Psychology, University of Milan-Bicocca, Milan, Italy

Mitja D. Back Department of Psychology, University of M€ unster, M€ unster, Germany

Daniel Cervone University of Illinois at Chicago, Chicago, IL, United States

Sanna Balsari-Palsule Wharton People Analytics, University of Pennsylvania, Philadelphia, PA, United States; Carleton University, Ottawa, ON, Canada

D. Angus Clark University of Michigan, Ann Arbor, MI, United States

University of Trier, Trier, Germany

Giulio Costantini Department of Psychology, University of Milan-Bicocca, Milan, Italy

Anna Baumert Max Planck Institute for Research on Collective Goods, Bonn; School of Education, Technical University Munich, Munich, Germany

Reeshad S. Dalal Department of Psychology, George Mason University, Fairfax, VA, United States

Emorie D. Beck Washington University in St. Louis, St. Louis, MO, United States Vero`nica Benet-Martı´nez Catalonian Institution for Advanced Research and Studies (ICREA) and Pompeu Fabra University, Barcelona, Spain

Rebekah L. Damitz Department of Psychology, West Virginia University, Morgantown, WV, United States

Laura E.R. Blackie University of Nottingham, Nottingham, United Kingdom

Charles C. Driver Max Planck Institute for Human Development, Humboldt University, Berlin, Germany

Nicola Baumann

M. Brent Donnellan Michigan State University, East Lansing, MI, United States

Gabriela S. Blum Department of Psychology, Technical University Dresden, Dresden, Germany Marleen De Bolle

David M. Dunkley Department of Psychology, McGill University, Montreal; Lady Davis Institute—Jewish General Hospital, Montreal, QC, Canada

Hudson Belgium, Ghent, Belgium

Annette Brose Humboldt-Universit€at zu Berlin, Berlin, Germany; KU Leuven, Leuven, Belgium; Max Planck Institute for Human Development, Berlin, Germany

Elizabeth A. Edershile Department of Psychology, University of Pittsburgh, Pittsburgh, PA, United States

Ashley D. Brown University of Southern California, Los Angeles, CA, United States

David M. Fisher United States

xv

The University of Tulsa, Tulsa, OK,

xvi

Contributors

William Fleeson Wake Forest University, WinstonSalem, NC, United States

Christian Kandler Department of Psychology, University of Bremen, Bremen, Germany

Marc A. Fournier University of Toronto Scarborough, Toronto, ON, Canada

Julia Krasko Department of Psychology, Ruhr University Bochum, Bochum, Germany

R. Michael Furr Department of Psychology, Wake Forest University, Winston-Salem, NC, United States

Julius Kuhl University of Osnabr€ uck, Osnabr€ uck, Germany

Marco R. Furtner University of Liechtenstein, Vaduz, Liechtenstein University of Munich, Munich,

Filip Lievens Lee Kong Chian School of Business, Singapore Management University, Singapore, Singapore

Christian Geiser Department of Psychology, Utah State University, Logan, UT, United States

Brian R. Little Wharton People Analytics, University of Pennsylvania, Philadelphia, PA, United States; Carleton University, Ottawa, ON, Canada

Josef H. Gammel Germany

Samuel D. Gosling The University of Texas at Austin, Austin, TX, United States; The University of Melbourne, Melbourne, VIC, Australia Birk Hagemeyer Institute of Psychology, Friedrich Schiller University Jena, Jena, Germany

Corinna E. L€ ockenhoff NY, United States

Cornell University, Ithaca,

Maike Luhmann Department of Psychology, Ruhr University Bochum, Bochum, Germany

Sarah E. Hampson Oregon Research Institute, Eugene, OR, United States

Aaron W. Lukaszewski Department of Psychology, California State University, Fullerton, CA, United States

Gabriella M. Harari Stanford University, Stanford, CA, United States

Dillon M. Luke The University of Texas at Austin, Austin, TX, United States

P.D. Harms Department of Management, University of Alabama, Tuscaloosa, AL, United States

E.J. Masicampo Department of Psychology, Wake Forest University, Winston-Salem, NC, United States

Patrick L. Hill Department of Psychological and Brain Sciences, Washington University St. Louis, St. Louis, MO, United States Fred Hintz Department of Psychology, Utah State University, Logan, UT, United States Joeri Hofmans Department of Work and Organizational Psychology (WOPs), Faculty of Psychology and Educational Sciences, Vrije Universiteit Brussel, Brussel, Belgium Kai T. Horstmann Institute of Psychology, Humboldt-Universit€at zu Berlin, Berlin, Germany Nathan W. Hudson Department of Psychology, Southern Methodist University, Dallas, TX, United States Hans IJzerman LIP/PC2S, Universite Grenoble Alpes, Grenoble, France

John D. Mayer University of New Hampshire, Durham, NH, United States Robert R. McCrae

Gloucester, MA, United States

Jay L. Michaels University of South Florida, Sarasota, FL, United States Lynn C. Miller University of Southern California, Los Angeles, CA, United States Brian Monroe Department of Psychology, University of Alabama, Tuscaloosa, AL, United States Alain Morin Department of Psychology, Mount Royal University, Calgary, AB, Canada D.S. Moskowitz Department of Psychology, McGill University, Montreal, QC, Canada

Joshua J. Jackson Washington University in St. Louis, St. Louis, MO, United States

Daniel K. Mroczek Department of Psychology, Northwestern University, Evanston, IL, United States

Eranda Jayawickreme Wake Forest University, Winston-Salem, NC, United States

Sandrine R. M€ uller Columbia New York, NY, United States

University,

Contributors

xvii

Marcus Mund Department of Personality Psychology and Psychological Assessment, Institute of Psychology, Friedrich Schiller University Jena, Jena, Germany

Michael J. Roche Department of Psychology, West Chester University, West Chester, PA, United States

Universit€at M€ unster, M€ unster,

Manfred Schmitt Department of Psychology, University of Koblenz-Landau, Landau, Germany

Franz J. Neyer Department of Personality Psychology and Psychological Assessment, Institute of Psychology, Friedrich Schiller University Jena, Jena, Germany

Oliver C. Schultheiss Department of Psychology, Friedrich-Alexander University, Erlangen, Germany

Steffen Nestler Germany

Andrzej Nowak Florida Atlantic University, Boca Raton, FL, United States; Department of Psychology, University of Warsaw, Warsaw, Poland

Gentiana Sadikaj Department of Psychology, McGill University, Montreal, QC, Canada

Mateu Servera Department of Psychology, University of Balearic Islands, Palma, Spain

of

Brinkley M. Sharpe Department of Psychology, University of Pittsburgh, Pittsburgh, PA, United States

Marco Perugini Department of Psychology, University of Milan-Bicocca, Milan, Italy

Nicole M. Silva Belanger Department of Psychology, West Virginia University, Morgantown, WV, United States

Monisha Pasupathi Psychology, University Utah, Salt Lake City, UT, United States

Le Vy Phan Germany

University of Luebeck, L€ ubeck,

Mike Prentice Wake Forest University, WinstonSalem, NC, United States

Joanna Sosnowska Department of Work and Organizational Psychology (WOPs), Faculty of Psychology and Educational Sciences, Vrije Universiteit Brussel, Brussel, Belgium

Emanuele Preti Department of Psychology, University of Milan-Bicocca, Milan, Italy

Seth M. Spain Department of Management, Concordia University, Montreal, QC, Canada

Markus Quirin Technical University of Munich, Munich; PFH G€ ottingen, G€ ottingen, Germany

Clemens Stachl United States

Famira Racy Department of Psychology, Mount Royal University, Calgary, AB, Canada

Kateryna Sylaska United States

Stanford University, Stanford, CA, Carthage College, Kenosha, WI,

Bielefeld University, Bielefeld,

Antonio Terracciano Florida State University College of Medicine, Tallahassee, FL, United States

Stephen J. Read University of Southern California, Los Angeles, CA, United States

Sophia Terwiel Department of Psychology, Ruhr University Bochum, Bochum, Germany

William Revelle Northwestern University, Evanston, Il, United States

Robert P. Tett The University of Tulsa, Tulsa, OK, United States

Juliette Richetin Department of Psychology, University of Milan-Bicocca, Milan, Italy

Mattie Tops Department of Clinical, Neuro- and Developmental Psychology, VU University Amsterdam, Amsterdam; Developmental and Educational Psychology Unit, Leiden University; ICLON Leiden University Graduate School of Teaching, Leiden, The Netherlands

John F. Rauthmann Germany

Julia Richter Department of Psychology, Bielefeld University, Bielefeld, Germany Rainer Riemann Department of Psychology, Bielefeld University, Bielefeld, Germany Whitney R. Ringwald Department of Psychology, University of Pittsburgh, Pittsburgh, PA, United States

Nicholas A. Turiano Department of Psychology, West Virginia University; West Virginia Prevention Research Center, Morgantown, WV, United States

xviii

Contributors

Robin R. Vallacher Florida Atlantic University, Boca Raton, FL, United States

William C. Woods Department of Psychology, University of Pittsburgh, Pittsburgh, PA, United States

Manuel C. Voelkle Max Planck Institute for Human Development, Humboldt University, Berlin, Germany

Aidan G.C. Wright Department of Psychology, University of Pittsburgh, Pittsburgh, PA, United States

Sarah Volz Department of Psychology, Wake Forest University, Winston-Salem, NC, United States

Cornelia Wrzus Psychological Institute, Ruprecht Karls University Heidelberg, Heidelberg, Germany

Peter Wang University of Southern California, Los Angeles, CA, United States Joshua Wilt Case Western Reserve University, Cleveland, OH, United States Dustin Wood Alabama Transportation Institute, University of Alabama, Tuscaloosa, AL, United States

Alexandra Zapko-Willmes Department of Psychology, University of Bremen, Bremen, Germany David C. Zuroff Department of Psychology, McGill University, Montreal, QC, Canada

Preface

Since its inception, personality psychology has concerned the dynamics, processes, and functioning of individuals. While early scholars such as Gordon Allport keenly emphasized the dynamic aspects of personality, interest shifted later to more structural approaches concerning the organization of traits. However, in the last 15years, there have been multiple efforts to bridge more structure- and more process-focused accounts of personality – and thus there has been a renaissance of dynamic approaches to personality. Currently, personality psychologists are showing a reinvigorated interest in the dynamic interplay of thoughts, feelings, desires, and actions within persons who are always embedded into social and cultural contexts. Several different phenomena (e.g., within-person variability, developmental processes, social dynamics, gene-environment interplay, psychopathological functioning) are studied with a range of methods and statistical analyses that take their dynamic nature into account (e.g., experience sampling, network modeling, systems-theoretical models). In these lines of research, dynamics and processes are tracked in different domains (e.g., cognition, emotion, motivation, regulation, behavior, self), contexts (e.g., social relationships, work), and timescales (e.g., in daily life, across the lifespan). Not surprisingly, dynamic approaches are quite heterogeneous. Complicating matters further is the fact that terms such as “dynamics,” “processes,” “mechanisms,” or “functioning” are used interchangeably, haphazardly, or differently

throughout the literature(s). Different literatures are also concerned with different forms of dynamics and processes, and they often operate in isolation from each other. It is thus time to bring together these literatures in one handbook. To do the burgeoning field of “dynamic” personality psychology justice, this handbook takes an inclusive approach to personality dynamics and processes and thus compiles a diverse array of topics and methods. Although the topics and methods are varied, they are tied together by the motivation to understand personality and individual differences from a more dynamic perspective. This dynamic perspective is not only intellectually stimulating (e.g., for building explanatory models and causal theories) but also practically relevant (e.g., for better understanding processes related to job performance, health, or interventions). It will thus be integral in building toward a more explanatory and practically useful science of personality. This handbook seeks to provide the first comprehensive compendium on personality dynamics and processes, with 51 chapters falling into core concepts and content domains of study (Section I: Chapters 1–17), conceptual perspectives and theories (Section II: Chapters 18–27), methods and statistics (Section III: Chapters 28–40), and applications (Section IV: Chapters 41–51). It was my goal that readers interested in a more dynamic understanding of personality would have all relevant theory, methods, and research assembled in one place. This handbook can thus serve as a gateway into dynamics- and

xix

xx

Preface

process-focused approaches for novices, but it can also be used by experienced scholars and teachers to read up on recent developments and trends in the literature. Being the editor of this handbook, I not only got to closely read all chapters and immerse myself deeply into new and fascinating topics, but also had the privilege to work with some of the finest personality-psychological scholars of our time and many rising stars who will become future leaders of the field. These experts have written easily accessible and engaging chapters that summarize the state of the art and future directions. I hope readers of this handbook will find the chapters as interesting, informative, and educational as I did.

I am deeply thankful to all authors who have contributed to this handbook and put their time, energy, and expertise into their well-crafted chapters. Additionally, I am also grateful for their continued support and patience as it took quite a long time to compile this handbook. However, I am confident that the end product was worth the wait and hope that this handbook will inspire a whole new generation of psychologists to adopt more dynamic perspectives when studying personality and individual differences—and thus showcase how broad the field of personality psychology is and what it has to offer. John F. Rauthmann Bielefeld, May 2020

C H A P T E R

1 The history of dynamic approaches to personality William Revellea and Joshua Wiltb a

b

Northwestern University, Evanston, Il, United States Case Western Reserve University, Cleveland, OH, United States

O U T L I N E Introduction

4

Early dynamic models The data box Time and change Variables showing dynamic processes Descriptive models

5 6 7 9 9

Control theory: The power of feedback Animal models Human models

10 12 12

Formal models of personality dynamics Dynamics of action Modeling goals

14 14 15

Modeling the dynamics of emotion and personality Dynamic processes 6¼ stochastic variation

17

Some classic diary studies

18

Data collection methods Self-report Behavior and physiology

20 20 21

The ESM revolution

21

Conclusions

23

References

24

important problem of studying individual differences in the coherent patterning over time of affect, behavior, and cognition.

Abstract The study of personality dynamics has a long history of being said to be important, but a much shorter history of actually being examined. We give an overview of the past 100 years of research on dynamic processes and suggest how recent methodological and analytic techniques can be applied to the

The Handbook of Personality Dynamics and Processes https://doi.org/10.1016/B978-0-12-813995-0.00001-7

16

Keywords Personality, Behavior, Goals, Idiographic, Dynamics, Communications

3

# 2021 Elsevier Inc. All rights reserved.

4

1. Personality dynamics

Introduction Just as a song is a coherent patterning over time of different notes and rhythms, so is personality a coherent patterning over time and space of feelings, thoughts, goals, and actions. A song is not the average note played, nor should a person be seen as an average of affects, cognitions, desires, and behaviors. For it is the dynamic patterning of these components that is the unique signature of a song as well as of a person. That it is the patterning, not the specific notes is clear when the haunting tune of Gershwin’s “Summertime” is played by a guitar trio, or a Beatles’ tune is played by the London Symphony Orchestra. Unfortunately, although easy to define personality in terms of dynamic patterns, it is much more difficult to study these patternings over time. The study of personality has long been divided into two broad approaches variously known as nomothetic versus idiographic, between person versus within person, structure versus process, statistical versus narrative, sociological versus biographical, cross-sectional versus developmental, and static versus dynamic. We hope to provide some linkage between these two cultures of personality research in the hope of an eventual integration. Although this chapter will not be nearly as thorough a review of the state of the field as that provided by Allport and Vernon (1930) we hope it is as useful today as their review 90 years ago. They provided a history of the study of personality dynamics up to 1930. Here we try to bring this forward to 2020. Or at least to about 2010. For the past 10 years have seen such an explosion of studies of the dynamics of affect, behavior, and cognition that it would be impossible to cover them all. Although exciting to witness such growth, we think it is important for the readers of this volume to appreciate the foundations behind much of the current work. That people differ from each other in the patterns of their feelings, actions, thoughts, and desires is obvious, and it is equally obvious that

each individual person varies in his or her thoughts, feelings, and behavior over time. We have claimed before that the study of personality is the study of the coherent patterning over time and space of affect, behavior, cognition, and desire (the ABCDs of personality) both between and within individuals. Although our long-term goal is an integrated model of human actions, to study coherence implies within person patterning, it is the between person differences in these patterns that have received most of our attention. An integrated theory requires combining the nomothetic between person and idiographic within person approaches into a unified framework. Unfortunately, this is difficult for the types of analysis that work between individuals do not necessarily work within individuals (Molenaar, 2004). To generalize from the group aggregate to the individual, or from the individual to the group, the process needs to be ergodic (Fisher, Medaglia, & Jeronimus, 2018; Molenaar, 2004) (but see Adolf & Fried, 2019). Conventional trait approaches suggest that when controlling for trait level, item responses will be uncorrelated. That is, what is left over after the between person signal is removed is just noise. We disagree and believe that to have an adequate understanding of personality, we need to be able to model responses within the individual over time as well as the between individual differences (Beck & Jackson, 2020; Nesselroade & Molenaar, 2016; Revelle & Wilt, 2016). Biographers and students of narrative identity disagree with the naive trait approach and suggest that the richness of a person’s life story is suitable for scientific investigation (McAdams, 1993, 2008). Indeed, this volume is concerned with the study of within person dynamics and it is appropriate to try to frame such research in terms perhaps more familiar to those of us who study individual differences between rather than within individuals. We hope this attempt to integrate nomothetic and idiographic approaches is not naive, and we know it is certainly not new. Almost 80 years ago, Cattell (1943) reviewed six

I. Concepts and domains

Early dynamic models

categories of traits, including the dynamic unities discussed earlier by Allport and Vernon (1930) and Allport (1937) as well as Stern (1910). The experimental psychologist, Woodworth, named his classic textbook Dynamic Psychology (Woodworth, 1918) as he attempted to answer both the questions of how (structure) and why (dynamic processes) of human and animal behavior. He continued this emphasis on dynamics for the next 40 years (Woodworth, 1958). Much of what is currently included in dynamic models reflects either explicitly or implicitly theories of motivation: the how and why of behavior. The terminology of motivation is that of needs, wants, and desires. The study of motivation is the study of how these needs and desires are satisfied over time. That is to say, to study motivation is to study dynamics (Atkinson & Birch, 1970; Heckhausen, 1991) However, there is more to dynamics than just motivation. For the patterning of thoughts, feelings and desires can be seen to reflect stable individual differences in rates of change of internal states in response to external cues. We think this emphasis on dynamics should continue.

Early dynamic models Perhaps because of an envy for the formalism of physics, Kurt Lewin wrote that to study behavior was to study its dynamics, for behavior was a change of state over time (Lewin, 1951). People’s states changed in response to the self-perceived situation, not the situation as defined by an observer (Lewin, Adams, & Zener, 1935). They responded to the entire field, not to any particular cue. To Lewin, “field theory is probably best characterized as a method: namely a method of analyzing causal relations and of building scientific constructs” (Lewin, 1951, p. 45). To understand the individual, one had to understand the field of forces impinging on the individual and the way those forces were perceived. An understanding of the goals of action was essential in understanding

5

the action (Zeigarnik, 1927/1967). Behavior was not a reaction to a particular stimulus but rather of the entire field of potential rewards and punishments. A very important summary of Lewin’s work was the introduction to American psychologists by Brown (1929). To read this is to understand the excitement of dynamic thinking that Lewin was emphasizing in contradistinction to the behaviorist movement, which was becoming popular in the United States. In his review Brown emphasizes the tension between the Gestalt psychologists of Europe and the behaviorism that was coming to dominate research in the United States. Lewin’s distinctions between identical motor movements needing to be understood in terms of their broader meaning (copying text versus writing a letter), or the significance of a post box when one has a letter to mail versus not make clear the need to study the motivational dynamics of behavior rather than the behavior per se. As Atkinson and Birch (1970) put it, motives had inertia and persisted until satisfied. They could not be studied without considering their dynamics over time (Zeigarnik, 1927/1967). Just as Berlin waiters could remember what their customers ordered for dinner until they had paid for it and then not be able to recall it, so did children remember the games they had been playing but had not yet finished rather than games that had reached a conclusion (Zeigarnik, 1927/1967). Similar results have been reported for unsolved versus solved anagrams (Baddeley, 1963) and depending upon the task, reflects competing motivations for success and failure avoidance (Atkinson, 1953). Inspired by Zeigarnik’s initial study, examination of the effects of interrupted tasks continues to this day to address the modern problem of timesharing between many tasks (Couffe & Michael, 2017) and is a major concern for computer scientists and human factor engineers. To what extent is the writing of a manuscript hindered by frequent interruptions from email or text notification? To what extent is the

I. Concepts and domains

6

1. Personality dynamics

learning of material by students hindered by their attempts at balancing the many demands, both social and intellectual as they attempt to time share their responses to these demands? Indeed, a web page with the delightful name of https://interruptions.net is dedicated to the memory of Bluma Zeignarnik with a voluminous reading list of the costs and benefits of interruptions and the dynamics of behavior.

The data box In order to integrate the study of temporal changes with cross-sectional measurement, Cattell (1946) introduced P techniques in his three-dimensional organization of data (the data box) that considered Persons, Tests, and Occasions.a Traditional personality descriptions (R analysis) were correlations of Tests over Persons, but some had proposed correlations of people over tests (e.g., Q analysis Stephenson, 1935, 1936), which allowed identifying clusters of people who showed similar profiles across tests. Finding the correlation of items for single individuals across time (P technique) allowed Cattell and his colleagues (Cattell, 1950; Cattell, Cattell, & Rhymer, 1947; Cattell & Cross, 1952; Cattell & Luborksy, 1950) to analyze dynamic traits. We now would view the Data Box as a way of conceptualizing nested multilevel data. For normally, variations over time (P technique) are nested within individuals (R-technique). Early use of P technique tended to be demonstrations for one or a few subjects, e.g., one subject suffering from a peptic ulcer was studied on 49 variables over 54 days (Cattell & Luborksy, 1950). The variables included psychophysiological measures such as blood glucose concentration and lymphocyte counts as well as objective

personality measures, self-reports, and peer ratings. Some of the personality variables had been chosen based upon prior R analyses with other subjects. Of the P factors identified within this subject, some matched R (between subject) factors, but some did not. A subsequent study (Cattell & Cross, 1952) followed one subject over 40 days with two observations per day. They choose marker variables from R analysis and then searched for matching factors in the P analysis. They refer to these motivational factors as ergsb and they plot the rise and fall over time of 10 such ergs as mating, parental protection, selfsentimental, etc. Both of these studies used normal correlations of the measures, and although graphically showing the changes of “ergs” over time, the actual factor analyses did not take the temporal patterning into account. That is, the correlations and their loadings on the factor structure would have been the same if the temporal sequence had been randomized. Unfortunately, this problem still plagues many P analyses and has started to be addressed with lagged correlations (Beck & Jackson, 2020) and dynamic factor analysis (Molenaar, 1985; Molenaar & Nesselroade, 2009). Many of these early models emphasized how people differed in their perceptions of the situation and that to understand the individual dynamics, we needed to understand these perceptions (Kelly, 1955). Kelly’s theory lives on with the use of his Role Construct Repertory Grid Test, which emphasizes the assessment of an individual’s important constructs rather than relying on some predetermined set. A perceptual model incorporating the dynamic effects of feedback was proposed by Combs and

a

Cattell (1966) subsequently enlarged the data box to include 10 dimensions, but it is the three-dimensional organization that is most helpful. He also varied the names for the six slices (P, Q, R, T, etc.) from publication to publication.

b

To read Cattell is to discover a completely idiosyncratic vocabulary, which, although useful in not carrying excessive meaning, has not been widely adopted.

I. Concepts and domains

Early dynamic models

Syngg (1952). For the negatively motivated individual (concerned with avoiding failure or the pain of failure) worries about failure lead to poor performance, which feeds back to produce even more worry. The more approach-oriented individual, however, perceives effort as an opportunity for success and tries harder, which tends to lead to success.

Time and change To study the dynamics of personality is to study changes in Affects, Behavior, Cognition and desires (ABCD) over time. These temporal changes need to analyzed in terms of the psychological spectrum (Revelle, 1989), which ranges from the milliseconds of reaction time to the seconds of emotion, the minutes of mood, the diurnal variation (8.64 ∗ 104 s) of arousal, testosterone, and body temperature, monthly menstrual rhythms (2.5 ∗ 106 s), seasonal variations in weather-related affect and behavior, year-toyear changes of educational experience, and the lifespan changes in development over 95 years (or 3 ∗ 109 s, see Fig. 1). Seemingly distinct domains of study differ in the duration examined, but all can benefit from thinking dynamically. Physiological studies of EEG or MRI examine neural changes over milliseconds to seconds, studies of basic signal detection focus on the accuracy, and reaction time to make simple or complex choices. Those who study the emotional effect of success and failure feedback examine changes in emotion and performance over minutes to hours. The diurnal rhythmicity of arousal interacts with stable trait measures of impulsivity to affect cognitive performance (Revelle, Humphreys, Simon, & Gilliland, 1980). Testosterone levels systematically decline during the day and affect the emotional reactions to angry faces (Wirth & Schultheiss, 2007). “Owls” and “larks” differ in the phase of their diurnal body temperature rhythm (Baehr, Revelle, & Eastman, 2000). Decrements in sustained performance known as failures of

7

vigilance affect drivers, sonar operators, and TSA security inspectors (Broadbent, 1971; Mackie, 1977) and has been associated with trait differences in extraversion (Koelega, 1992). Life span developmental psychologists focus on the dynamic stages of lives as well as the cumulative record of accomplishment (Damian, Spengler, Sutu, & Roberts, 2019; Lubinski, 2016; Lubinski & Benbow, 2006; Oden, 1968; Spengler, Damian, & Roberts, 2018; Terman & Baldwin, 1926). They understand how the systematic changes in life demands from childhood through adolescence, young adulthood, parenthood, and aging shape behavior is to study the dynamic coherence of being. Just as the dayto-day weather fluctuates drastically, and seasonal changes in climate lead to large changes in mean levels, so can we need to analyze individual differences at these different temporal frequencies (Baumert et al., 2017; Revelle & Condon, 2017). One such approach was discussed by Larsen (1987) who introduced spectral analysis to the study of personality and emotion and provided a very helpful review of the prior literature. In his examination of mood variation over multiple days, Larsen (1987) specifically rejected using time series design and rather focused on the spectrum of frequencies that represent mood variation. In a compelling footnote, he distinguishes between deterministic periodicity (equivalent to the timing of a pendulum) versus randomly perturbed stochastic processes such as a pendulum being shot at by a mischievous child with a peashooter. A relatively unknown but important early study showing the rhythmicity and variability of mood was done by Johnson (1937) who examined the mood of 30 female students over 65–90 days at the University of California, Berkeley. The rating scale was a single item with ratings of euphoria versus depression ranging from “I almost never feel more elated,” to “as depressed as I practically ever feel.” Although euphoria and depression are probably not

I. Concepts and domains

8

1. Personality dynamics 10−3 1 ms

10−2

10−1

10 ms

100 ms

100 1 s

101

102

10 s

0Þ. That is, it suggests that, in the population, the pattern of differences between people is cross-situationally consistent to some degree. The situation effect is similar. It is determined by the degree of consistent differences between situations: Fsituation ¼

σ^2residual + nP CS σ^2residual

(28)

For the data in Table 1 (and Table 3B) and consistent with Output 1: Fsituation ¼

6:27 + 6ð11:54Þ 6:27

Fsituation ¼ 12:05 Thus, a significant Fsituation-value also indicates a nonzero degree of consistency (i.e., CS > 0). It suggests that, in the population, the pattern of differences between situations is consistent across people to some degree.

TABLE 5

The slightly more complex case: Persons, situations, and person-situation interaction We have focused so far on the simplest case of a person-situation data box, in which each person is observed only once in each situation. Importantly, all ideas presented thus far generalize to more complex cases. This is because, in various forms, both variability and consistency are at the heart of ANOVA. Perhaps the most natural extension is a design in which each person is observed more than one time in each situation, as illustrated in Table 5. In this design, the interaction between persons and situations can be separated from the residual term. For now, let us imagine that the two observations in each cell are not systematically separable (e.g., there were many different observers who were unsystematically assigned to rate various people in various situations on behavioral assertiveness; or behavioral assertiveness was measured with several different variables that varied unsystematically across persons and situations; or behavioral assertiveness was measured several times within each

A person-by-situation data box in which each person is observed twice within each situation. Situation

Person

No conflict

Peer conflict

Authority conflict

Mean

Var

Allison

8.1

8.3

7.3

7.1

15

18

10.63

26.06

Brad

9.3

8.9

2.5

3.5

10

9.8

7.33

14.24

Claire

12.6

13.2

11.6

11.4

11.4

11.6

11.97

0.65

Dale

14

13.8

9.7

10.5

18.2

18

14.03

16.01

Ella

14.5

15.1

4.5

4.3

14.1

15.5

11.33

36.05

Fred

10.2

10

4.3

7.3

13.3

13.1

9.70

13.81

Mean

11.50

7.00

14.00

Var

7.44

10.82

9.48

Note: Variances are computed on cell means (i.e., values in Table 1).

III. Methods and statistics

750

28. Variability and consistency in the analysis of behavior

situation). ANOVA and variance components for these data are: Name Person Situation Person: situation Error

DF 5 2 10

SS 152.09 302 125.35

MS 30.42 151 12.53

VC 2.98 11.54 5.95

18

11.44

0.64

0.64

Out. 2

Person and situation effects For the person effect and situation effect, these values are closely related—or identical— to those in the simplest case, as discussed earlier. For our example, the mean squares values in Output 2 are simply twice the earlier mean squares values (due to 2 observations per cell in the current design), and the variance components are identical to the earlier values in Output 1. Thus, the meanings and implications of the person effect and situation effect are identical in this case and the simpler case. Table 6 provides the relevant equations linking the person and situation effects to variability and consistency terms, as defined in Table 2A. TABLE 6

Residual/error as within-cell differences In this more complex design, it is possible to separate the interaction from the residual/error (this was not possible in the simple case, as discussed before). As shown in Table 6, the residual is the degree to which the values within each cell differ from each other. For example, for Cell 1, Averaging VC ¼ (8.1–8.2)2 + (8.3–8.2)2 ¼ 0.02. such residual values across all 18 cells, V C ¼ :64. Thus, consistent with Output 2: MSresidual ¼ σ^2residual ¼ V C

(29)

MSresidual ¼ σ^2residual ¼ :64 For the data in Table 5, this reflects the average degree to which there are unexplained discrepancies in the observed behavioral assertiveness values for a given person in a given situation. Note that, as illustrated in the next example (see Table 7), it might be possible to account for some of these discrepancies. That is, depending on the design, there might be systematic differences between the observations within cells. If so, then a different statistical model could account for those systematic

ANOVA values in terms of variability and consistency—for case 2 (see Table 5). ICC

Effect Person

MS   nO V P + CP ðnS  1Þ

Situation

  nO V S + CS ðnP  1Þ

CS

PS interaction

  nO V P  CP

V P  CP  VC

or   nO VS  CS

VC

Rel

Abs

F

CP

CP VP

CP V P + CS

σ^2R + nS CP σ^2R

CS VS

CS V S + CP

σ^2R + nP CS σ^2R   nO V P  CP

. nO

or . V S  CS  V C

V P  CP  V C

VC

nO

V P  CP  V C

V P CP

or nO

VS  CS  VC VS CS

Residual

.

. nO

V P + CS

or

. nO

V S  CS  V C VS + CP

. nO

σ^2R

or   nO VS  CS σ^2R

VC

Note: MS, mean squares; VC, variance components; ICC, intraclass correlation; Rel, relative; Abs, absolute; F, F-value; σ^2R , variance component for residual term (^ σ 2residual ). Abs icc is based on averaging across observers (no/2). Eta squared is omitted due to space limitations.

III. Methods and statistics

751

The slightly more complex case: Persons, situations, and person-situation interaction

differences. However, the current example reflects a case in which within-cell discrepancies are not systematic, or a case in which the discrepancies are systematic but not statistically modeled.

Person × situation interaction as less-thanperfect consistency The person x situation interaction is related to inconsistency, as was the residual term from the simpler case. To the degree that the patterns of differences are inconsistent, the interaction will be large. For example, as shown in Table 6, the interaction reflects the degree to which the pattern of differences between persons is inconsistent across situations:   (30) MSPS ¼ nO V P  CP , where no is the number of observations per cell. For the data in Table 5 (see Output 2): MSPS ¼ 2ð9:25  2:98Þ

Such differences likely reflect psychological states or processes that systematically affect behavior within a given situation but that change across situations. In the current data (Table 5), mood might play such a role. For example, mood might affect the degree to which one behaves assertively, thus contributing to pattern of differences between people in any given situation (e.g., persons experiencing high positive affect might be relatively assertive, while those experiencing low positive affect might be relatively unassertive). However, one’s mood may change from one situation to another, thereby changing the pattern of differences between people. The person x situation interaction reflects the effect of such states or processes. With reference again to the special case of equal person variances, the interaction becomes particularly clear as reflecting the degree to which differences are less than perfectly consistent (where, as described earlier, rP is the average cross-situational consistency correlation): . (32) σ^2PS ¼ VP ð1  rP Þ  VC nO

MSPS ¼ 12:53: σ^2PS

reflects In terms of variance components, “reliable” inconsistency. That is, it reflects the amount of inconsistent variability that is not simply due to error—differences between people that are statistically reliable but not consistent across situations (see Table 6): . (31) σ^2PS ¼ V P  CP  V C nO

In terms of the data in Table 5 and consistent with Output 2:  σ^2PS ¼ 9:25  2:98  :64 2

As summarized in Table 6, the interaction can also be framed in terms of (in)consistency of the pattern of differences between situations:   (33) MSPS ¼ nO V S  CS Again, in terms of variance components: . σ^2PS ¼ V S  CS  V C nO

For the data in Table 5 and consistent with Output 2: MSPS ¼ 2ð17:81  11:54Þ MSPS ¼ 12:53:

σ^2PS ¼ 5:95

σ^2PS ¼ 17:81  11:54  :64

This is a moderately large value as compared to the person and situation variance components. This suggests that there are indeed some reliable differences between people (within situations) that are not consistent across situations.

(34)

 2

σ^2PS ¼ 5:95 From this perspective, the interaction indicates that there are some reliable differences between the situations that are not consistent across

III. Methods and statistics

752

28. Variability and consistency in the analysis of behavior

people. Such differences again likely reflect psychological states or processes that systematically affect behavior within a given situation but that change across situations. Again, as a given person’s mood changes across situations, his or her behavior may systematically change in corresponding ways (e.g., behaving more assertively when experiencing high positive affect in a situation, and behaving less assertively when experiencing low positive affect in a situation). If such changes in mood are idiosyncratic between people (e.g., one person’s changes in mood are unlike another person’s pattern of changes in mood), then they create inconsistency (between) persons in the patterning of behavior across situation. Though framed in a somewhat different way, this is the same psychological process discussed in the initial interpretation of the person x situation interaction, earlier. In sum, the person effect is the consistent variability in the pattern of differences among people. Similarly, the situation effect is the consistent variability in the pattern of differences among situations. The interaction term, in contrast, reflects variability that is inconsistent but reliable (i.e., not simply error variability). Lastly, the error term reflects variability that is both inconsistent and unreliable.

Inferential statistics (again) are related to consistency Like the simpler design, this design’s inferential statistics are also directly related to consistency (see Table 6). For the effects of persons and of situations, the F-values arec: . 2 Fperson ¼ σ^residual + σ^2PS + nS CP nO . (35) σ^2residual + σ^2PS nO

. 2 Fsituation ¼ σ^residual + σ^2PS + nP CS nO . σ^2residual + σ^2PS

(36)

nO

A significant F-value (i.e., one significantly larger than 1.0) indicates a nonzero degree of consistency, in terms of the particular effect being tested. For example, if the F-test for the person effect is greater than 1.0, then the average cross-situational consistency covariance (CP ) is greater than zero. That is, if the value is greater than 1.0, then—to some degree—the pattern of differences between people is consistent across situations. If the F-value is statistically significant, then there is likely to be a nonzero amount of consistent between-person variance in the population. The F-test for the situation effect is similar, indicating whether the pattern of differences between the situations is consistent across people. For the data in Table 5 (see Output 2): Fperson ¼

:64

 2

+ 5:95 + 3ð2:98Þ 6:27 + 8:94  ¼ 2:43 ¼ 6:27 2 + 5:95

:64

:64



+ 5:95 + 6ð11:54Þ 6:27 + 69:24  ¼ :64 6:27 2 + 5:95 ¼ 12:04

Fsituation ¼

2

A more than “slightly more” complex case: Persons, situations, and observations The previous section addressed a case in which each person is observed multiple times within each situation. In that section, the differences between those multiple observations were treated as unsystematic; however, it might be possible to systematically differentiate the observations within each situation.

c

These F-tests are based on a model in which the person factor is treated as random, and they would apply regardless of whether the situation effect is treated as random or fixed.

III. Methods and statistics

753

A more than “slightly more” complex case: Persons, situations, and observations

For example, when measuring assertive behavior, researchers might ask two observers to rate all participants in each situation. In such a case, the two observations in each cell are systematically different from each other, and they can be systematically connected to observations in the other cells. For the data in Table 5, imagine that the “left” score in each cell was provided by Observer A, and the “right” score by Observer B. Analytically, “Observer” can be added as factor in the design, with all three factors—Persons, Situations, and Observers—fully crossed. A model with three main effects and three two-way interactions produces with the data in Table 5: Name Person Situation Observer Person: situation

TABLE 7 Effect Person

DF 5 2 1 10

SS 152.09 302.00 2.15 125.35

MS 30.42 151.00 2.15 12.53

MS V PwSO + CPbS ðnS  1Þ +

V SwPO + CSbP ðnP  1Þ +

CSbO ðnO  1Þ + ðnP  1ÞðnO  1ÞCSbPO PS Interaction

Residual

Out. 3

5

1.13

0.23

-0.17

2

0.68

0.34

-0.07

10

7.48

0.75

0.75

This more complex design merits discussion because it demonstrates important extension of, and differences from the simpler cases. As we shall see, the more complex design extends and generalizes the simpler designs, in that its effects are similarly based upon variability and consistency. However, it also differs from the previous designs in important ways. A discussion of several key effects (person, situation, person x situation interaction) should provide some insight into this. For these effects (and the residual), Table 7 presents the links between ANOVA terms and terms representing variability and consistency (as summarized in Table 2B).

ANOVA values in terms of variability and consistency—for case 3 (see Table 5).

CPbO ðnO  1Þ + ðnS  1ÞðnO  1ÞCPbSO Situation

VC 3.07 11.57 0.13 5.89

Person: observer Situation: observer Error

V PwSO  CPbS +

VC

ICC Rel

CPbSO

CPbSO V PwSO

CSbPO

CSbPO V SwPO

σ^2R + CSbP ðnP Þ + CSbO ðnO Þ + CSbPO ðnP nO  nS  nO Þ σ^2R + CSbP ðnP Þ + CSbO ðnO Þ  CSbPO ðnP + nO Þ

CPbO  CPbSO

CPbO  CPbSO V PwSO  CPbS

or ðnO  1ÞCPbO  ðnO  1ÞCPbSO or or CSbO  CSbPO C  C VSwPO  CSbP + SbO SbPO ðnO  1ÞCSbO  ðnO  1ÞCSbPO VSwPO  CSbP V PwSO  CPbS  CPbO + CPbSO

or VSwPO  CSbP  CSbO + CSbPO

F σ^2R

+ CPbS ðnS Þ + CPbO ðnO Þ + CPbSO ðnS nO  nS  nO Þ σ^2R + CPbS ðnS Þ + CPbO ðnO Þ  CPbSO ðnS + nO Þ

  σ^2R + nO CPbO  CPbSO σ^2R

or  σ^2R + nO CSbO  CSbPO σ^2R

same as MSresidual

Note: MS, mean squares; VC, variance components; ICC, intraclass correlation; Rel, relative; F, F-value; σ^2R , variance component for residual term (^ σ 2residual ). The full model would include additional terms (e.g., Observer main effect, Person  Observer interaction, etc.). Eta squared and Absolute ICC are omitted due to space limitations.

III. Methods and statistics

754

28. Variability and consistency in the analysis of behavior

Person effect and situation effect still reflect variability and consistency In the more complex design, the person effect still hinges on two factors. One is variability— differences between people; the other is consistency—the degree to which the pattern of differences between people is consistent across situations and/or observers. In terms of the mean squares, two key factors affect the person effect (see Table 2B). One is the degree to which persons differ within each combination of situation and observer. In Table 5, consider Observer A’s ratings of persons in the No Conflict situation. The variance of those ratings is VP ¼ 6.91, reflecting the degree to which Observer A perceived differences between participants in that situation. This is the variance of person differences within a combination of situation and observer (VPwSO). Computing similar variances for all six combinations (Observer B in the No Conflict situation, Observer A in the Peer Conflict situation, etc.), and averaging those variances produces V PwSO ¼ 9:53—the average person variance, within situations and observers. This represents the degree of differences between persons that a given observer is likely to see in a given situation. The second factor affecting the person effect is the degree to which the pattern of differences between persons is consistent between situations and/or observers. There are three facets to this consistency, as summarized in Table 2B. The first facet reflects the degree to which a given observer views the pattern of differences among persons as being consistent between situations (CPbS Þ. For example, Observer A rates Allison as less assertive than Dale in both the No Conflict situation and in the Peer Conflict situation. Thus, Observer A sees the pattern of differences between people as at least somewhat consistent between those two situations. The average of all six possible between-situation/within-observer covariances is CPbS ¼ 2:89. The second facet of consistency is the degree to which, within a given

situation, the pattern of differences among persons is consistent between observers (CPbO Þ. For example, within the No Conflict situation, both Observers A and B rate Allison as less assertive than Dale. Thus, Observer A sees the pattern of differences between persons in the No Conflict situation in a way that is at least somewhat consistent with Observer B’s ratings in the same situation. In the current example, this is akin to inter-rater measurement reliability. The average of all six possible between-observer/withinsituation covariances is CPbO ¼ 8:96. The third facet of consistency reflects the degree to which the pattern of differences among persons is consistent between situations and between observers (CPbSO ). For example, the covariance between Observer A’s ratings in the No Conflict situation and Observer B’s ratings in the Peer Conflict situation is 2.51. There are six such covariances from Table 5, each reflecting the degree to which the pattern of differences among people is consistent as rated by different observers viewing the persons in different situations. The average of these is CPbSO ¼ 3:07. Based on these consistency covariances and on V PwSO (and upon ns and no, the number of situations and observers), Table 7 shows that the mean squares of the person effect is (within rounding): MSP ¼ VPwSO + CPbS ðnS  1Þ + CPbO ðnO  1Þ + ðnS  1ÞðnO  1ÞCPbSO (37) For the data in Table 5 and consistent with Output 3: MSP ¼ 9:53 + 2:89ð3  1Þ + 8:96ð2  1Þ + ð3  1Þð2  1Þ3:07 MSP ¼ 9:53 + 5:78 + 8:96 + 6:14 MSP ¼ 30:42 The variance component of the person effect has a simpler interpretation. As shown in Table 7, it

III. Methods and statistics

A more than “slightly more” complex case: Persons, situations, and observations

is the degree to which the pattern of differences between persons is consistent across different situations and different observers: σ^2P ¼ CPbSO

(38)

For the data in Table 5 and consistent with Output 3: σ^2P ¼ 3:07 Note the parallel between this term and the σ^2P term from the simpler designs. In those designs, the variance component for the person effect reflected the consistency of person differences across (different) situations. Here, it reflects the consistency of person differences across (different) situations and (different) observers. Also as shown in Table 7, an intraclass correlation for the person effect is: CPbSO ^ ρPrelative ¼ V PwSO

(39)

For the data in Table 5: ^ ρPrelative ¼

3:07 ¼ :32 9:53

This reflects the consistency of differences between persons (across situations), in relation to the typical degree to which persons differ from each other (within a situation). It represents the likely correlation between any two situations, where those situations were observed by different observers. Indeed, the average of the six correlations (between different observers and different situations) from Table 5 is rPbSO ¼ :32. In the special case of equal person variances for each combination of situation and observer, ^ ρPrelative is exactly the average correlation between combinations of situations and observers (i.e., the expected degree of consistency between an observer from one situation and a different observer from a different situation): ^ ρPrelative ¼ rPbSO

(40)

755

Turning to the situation effect, the same ideas apply. The situation effect, in part, hinges on variability in situations’ effects on behavior (i.e., the degree to which behavioral assertiveness differs within the typical person as rated by a single observer). It also hinges on consistency—the degree to which the pattern of situational differences is consistent between (different) persons and (different) observers. For example, as shown in Table 7, the variance component for the situation effect equals the average covariance (across persons and across observers) of situational differences (see Table 2B): σ^2S ¼ CSbPO

(41)

For the data in Table 5 and consistent with Output 3: σ^2S ¼ 11:57 That is, it reflects the degree to which the pattern of differences among situations is consistent across (different) people, when those people are rated by different observers. For example, in Table 5, Observer A rates Allison as more assertive in the No Conflict situation than in the Peer Conflict situation, but most assertive in the Authority Conflict situation. Observer B’s ratings of Brad’s pattern of assertiveness are generally consistent with this—more assertive in the No Conflict situation than in Peer Conflict, but most assertive in the Authority Conflict situation. Thus, the pattern of differences among situations is consistent between different people as rated by different observers.

Person-situation interaction as less consistency than would be possible, given measurement quality Of the remaining terms in the model, the person x situation interaction is likely to be of most interest to researchers interested in personality dynamics. For (some) brevity’s sake, a focus

III. Methods and statistics

756

28. Variability and consistency in the analysis of behavior

on the variance component of this term should convey its meaning. It builds on concepts similar to the person and situation effects themselves. From the perspective of the person effect, the person x situation interaction can be seen as the difference between two consistencies. Of primary interest is a form of consistency noted earlier (see Table 2B)—CPbSO , the degree to which the pattern of differences among persons is consistent between different situations and different observers. As just noted, larger values suggest that the people who are relatively assertive in one situation, as rated by one observer, are likely to be relatively assertive in different situations, as rated by different observers. The other form of consistency is CPbO , which (also as noted earlier) reflects the degree to which the pattern of differences among persons is consistent between observers (within situations). Based upon these two consistencies, the variance component for the person effect is (see Table 7): σ^2PS ¼ CPbO  CPbSO

(42)

For the data in Table 5 and consistent with Output 3: σ^2PS ¼ 8:96  3:07 σ^2PS ¼ 5:89 Thus, a large σ^2PS indicates that cross-situational consistency is weaker than within-situational consistency (i.e., inter-rater reliability). A small σ^2PS indicates that cross-situational consistency is relatively robust as compared to withinsituation consistency. An intraclass correlation can be obtained for the person  situation interaction, as: ^ ρPSrelative ¼

CPbO  CPbSO V PwSO  CPbS

For the data in Table 5: 8:96  3:07 ^ ¼ :89, ρPSrelative ¼ 9:53  2:89

where CPbS , as noted earlier, reflects consistency between situations within observers (i.e., the degree to which an observer’s ratings of persons in one situation are consistent with her/his ratings of the persons in another situation). In the special case of equal variances within each combination of situation and observer: ^ρPSrelative ¼

rPwS  rPbSO 1  rPwO

(44)

Briefly, the person–situation interaction can also be framed in terms of the consistency of situational differences (see Tables 2B and 7): σ^2PS ¼ CSbO  CSbPO

(45)

For the data in Table 5: σ^2PS ¼ 17:46  1:57 σ^2PS ¼ 5:89 Here, CSbPO is as defined earlier – a value reflecting the degree to which the pattern of differences among situations is consistent between persons and observers. CSbO is the typical degree to which the differences among situations are consistent between observers (within persons). This term, again, can be seen akin to measurement reliability—the degree to which two observers agree on the way in which a given person’s behavior changes across situations. From this perspective, a large σ^2PS indicates that the consistency (between persons) of situational effects is weaker than within-person consistency (i.e., inter-rater reliability). In contrast, a small σ^2PS indicates that situations affect different persons in a highly similar way, even as compared to what would be expected on the basis of inter-rater agreement.

(43)

Residual/error as related to consistency As seen in the simpler designs, the residual/ error term in this more complex design can be seen in terms of consistency. It can be represented

III. Methods and statistics

Alternative estimation procedures

in terms of various combinations of consistency terms; however, based upon terms articulated earlier (and as summarized in Table 2B);, it is: σ^2residual ¼ V PwSO  CPbS  CPbO + CPbSO

(46)

For the data in Table 5 and consistent with Output 3: σ^2residual ¼ 9:53  2:89  8:96 + 3:07 σ^2residual ¼ 9:53  8:78 σ^2residual ¼ :75

As in simpler designs, inferential statistics in the more complex design are related to consistency. For example, based upon terms defined earlier and summarized in Table 2B, the F-value for the person effect is (see Table 7)d: Fperson ¼ σ^ 2R + CPbS ðnS Þ + CPbO ðnO Þ + CPbSO ðnS nO  nS  nO Þ σ^ 2R + CPbS ðnS Þ + CPbO ðnO Þ  CPbSO ðnS + nO Þ (47)

For the data in Table 5: :75 + 2:89ð3Þ + 8:96ð2Þ + 3:07ð3  2  3  2Þ :75 + 2:89ð3Þ + 8:96ð2Þ  3:07ð3 + 2Þ Fperson ¼

27:34 + 3:07 27:34  15:35

Fperson ¼

differences between situations and observers). If this consistency value is zero (i.e., there is no consistency, on average), then F ¼ 1.0. However, if this consistency value is greater than zero, then the numerator increases and the denominator decreases, creating an F > 1.0. Thus, the F test of the person effect is a test of whether, in the population, the pattern of differences (among persons) is consistent between situations and observers.

Alternative estimation procedures

Inferential statistics (yet again) are related to consistency

Fperson ¼

757

30:42 12:01

Fperson ¼ 2:53 Note that the numerator and denominator differ only with regard to CPbSO (consistency of person

This chapter has focused on the meaning of effects as estimated from ANOVA. This focus is important because most researchers are quite familiar with ANOVA, which has been used to conceptualize and analyze person-situation phenomena for decades (e.g., Bowers, 1973; Endler & Hunt, 1966; Leising & M€ uller-Plath, 2009). Further, several conceptual/analytic frameworks build upon this approach (e.g., Ozer, 1986; Shoda, 1999. However, there are alternative methods for estimating some terms discussed in this chapter. For example, multilevel models (also known as linear mixed models, etc.) can be used to estimate variance components for the types of data discussed in this chapter. To do this, such models (e.g., via “lmer” in R, “proc mixed” in SAS, “mixed” procedure in SPSS) use estimation methods such as maximum likelihood or restricted maximum likelihood, rather than ANOVA-based methods. Although the results obtained from such alternative estimation procedures differ from the results obtained via the ANOVA estimation procedures, the differences are often slight. For example, here are variance component estimates for the “slightly more complex” design in Tables

d

This F test is again based on a model in which the person factor is treated as random, and it would apply regardless of whether the situation and observer effects are treated as random or fixed.

III. Methods and statistics

758

28. Variability and consistency in the analysis of behavior

5 and 6 (with persons and situations as factors), as obtained via three estimation methods: Person Situation Person:situation Error

ANOVA 2.98 11.54 5.95 0.64

ML 2.73 7.73 6.02 0.64

REML 2.98 11.54 5.95 0.64

Note that the estimates are generally quite similar across methods. In fact, the REML estimates are identical to the ANOVA estimates in this particular case (though this is not always so). Thus, the interpretations and meanings articulated in this chapter, in terms of ANOVA models, at least roughly apply to the results of those alternative estimation procedures as well.

Summary From a conventional perspective, person effects and situation effects are viewed in terms of the magnitude of differences among means. To what degree do situations elicit different levels of a behavior, as averaged across people? To what degree do persons exhibit different levels of a behavior, as averaged across situations? Although this view is conventional, it misses two important points that are likely highly relevant to personality psychologists working to integrate persons and situations. One point is that all elements of ANOVA (and similar procedures) can be seen in terms of both variability and, importantly, consistency. In other words, all effects reflect—in one way or another—the degree to which differences exist in a single observation and the degree to which a pattern of differences is consistent across observations. Thus, variability and consistency are not polar opposites, but different sides of the same coin. For example, as illustrated earlier for a relatively simple case, the person effect (e.g., σ^2P ) reflects both the degree to which people differ from each other in a situation and the degree

to which the pattern of those differences is consistent across situations. Similarly, the situation effect σ^2S ) reflects the degree to which situations differ from each other in any given person and the degree to which the pattern of differences (among situations) is consistent across persons. Although the specific form and meaning of the effects change as the design and analysis change, the effects always reflect some blend of variability and consistency. The second and related important point is that, to obtain large effects (at least large main effects of persons and situations), there must be both consistency and variability. For example, obtaining a large main effect of persons requires that the pattern of differences (among persons) is highly consistent across situations and that those differences are large in magnitude. Similarly, obtaining a large effect of situations requires that the pattern of differences (among situations) is highly consistent across people and that those differences are large in magnitude. If either variability is small or consistency is weak, there will be only small effects, at best. Alongside these general points, the material in this chapter provides conceptual insights into specific results of likely interest to personality psychologists. For several common designs, this chapter articulates the psychological meaning— in terms of variability and consistency—of mean squares, eta squared effect sizes, variance components, and intraclass correlations. All of these are either commonly reported results, or have been proposed as the basis of general conceptual frameworks for personality psychology. A key goal was to articulate their meaning in terms of (hopefully) intuitive concepts reflecting variability and consistency. By doing so, this chapter is intended to enhance researchers’ ability to analyze and interpret personological data in ways that are as psychologically informative as possible. Moving forward, it is likely worth considering particular analytic strategies that might be used to address specific types of questions, in light of the concepts discussed in this chapter.

III. Methods and statistics

Summary

An additional point worth noting is that person effects and situation effects are not necessarily “competitive.” Consider again the simplest design described (see Table 1). In such a design, it is possible to have perfect consistency of both person effects and situation effects. That is, it is possible that the pattern of differences among people is perfectly consistent across situations and—at the same time—that the pattern of differences among situations is perfectly consistent across persons. Consistency in one sense neither requires nor implies inconsistency in the other sense; they are orthogonal. Thus, for researchers who are interested in the meaning, correlates, and consistency of patterns of differences, persons and situations are not “competitive.” Moreover, even if one views the magnitude of differences as more important than the patterning of those differences, consistency is crucial. As this chapter has shown, when evaluating whether differences are systematic (or reliable, or not simply error), the key statistics make reference to the consistency of those differences. For example, as shown in Eq. (28), the F-test of the situation effect (in the simplest case above) is directly affected by consistency—the degree to which the pattern of differences among situations is consistent across people (CS ). To the degree consistency is greater than zero, the differences among situations are not random, error variation. However, if the situational differences are large within each person but inconsistent between people, then those differences do not represent systematic, normative effects of situations.

Utility and implications The material presented in this chapter is useful for a process-oriented personality psychology, for three reasons. First, a personality psychology that is process-oriented requires an integrated view of persons and situations (Furr & Funder, 2021). This integration must occur at both a theoretical and an analytic level. The fact that both person and situation effects

759

integrate components of variability and consistency means that questions of traditional interest to personality psychology (i.e., degree of consistency) and those of interest to situational psychology (i.e., magnitude of variability) are themselves integrated in driving key analytic effects. Both personality effects and situational effects can be seen in terms of variability and consistency. A second way in which this chapter is relevant for a process-oriented personality psychology is that some integrative conceptual and analytic frameworks depend upon terms and concepts articulated in this chapter. For example, Ozer’s (1986) framework for operationalizing consistency integrates persons and situations along with time and response classes. It emphasized variance components and generalizability coefficients (or intraclass correlations) as key statistical indices of interest. Moving in a quite similar direction, Shoda (1999) advocated a framework based upon variance components and intraclass correlations as reflecting person effects and person-situation effects. Also somewhat similarly, Leising and Igl (2007) argued that person effects should be indexed via a particular intraclass correlation derivable from variance components in a 2-situation (by multiple person) data box. By considering concepts that are fundamental to those frameworks and framing them in ways that may be relatively intuitive (in terms of simple variances and covariance/ consistency), this chapter may enhance readers’ understanding and appreciation for those ambitious and integrative frameworks. A third way in which this chapter is relevant for a process-oriented personality psychology is that such a psychology likely requires a focus on patterns of situational effects (or crosssituational behavioral profiles; Furr, 2009a). The development of a psychology of situations and its integration into personality theory will require and stimulate conceptual, methodological, and analytic advances. One key area of advance involves the conceptualization of

III. Methods and statistics

760

28. Variability and consistency in the analysis of behavior

situations in terms of variables that differ in degrees, rather than simply in terms of categories or types of situations (e.g., Rauthmann et al., 2014; Wagerman & Funder, 2009). Another advance is the measurement of such situational variables (e.g., Rauthmann & Sherman, 2016; Wagerman & Funder, 2009). Yet another key advance is in the ability to “observe” participants’ behavior in a wide range of situations, and again advances are occurring (e.g., experience sampling procedures, Brown, Blake, & Sherman, 2017; Conner, Tennen, Fleeson, & Barrett, 2009; Furr, 2009b). Finally, the recognition that “patterning” of situational effects is indeed a key part of common statistical techniques may advance the analysis of integrative questions. For example, differentiating the typical magnitude of situational effects from the degree to which the pattern of situational effects is consistent (across people) is itself useful. Such a differentiation can reveal which of three psychological possibilities holds true for a given set of situational effects. Specifically, when examining a set of situations and their general effects on behavior, the pattern of those effects might be: (a) totally idiosyncratic and unpredictable (i.e., dramatically inconsistent across people and not systematically related to a personality characteristic), (b) inconsistent across people, but predictable (e.g., on the basis of some personality characteristic), or (c) highly consistent across people (e.g., thus perhaps so universal that it is unrelated to typical “individual difference” variables). Revealing which of these basic possibilities holds true is important, as it sets the stage for further questions (e.g., regarding the degree to which personality variables moderate the effect of situations on behavior). A final point implicit in some of the previous points concerns the ambiguity inherent in the fact that many key concepts in ANOVA and similar methods blend variability and consistency. Because the results such as mean squares (and

typical ANOVA-based p-values), eta squared, and variance components reflect both variability and consistency, they may not reflect either clearly. Moreover, issues of variability and consistency are conceptually separable and are likely to have importantly different psychological implications. Thus, researchers should consider carefully which statistical values reflect their interests— those that do not differentiate variability and consistency, or those that do. An undifferentiated statistic, such as mean squares (and corresponding p-values), eta squared, and variance components may, however, match some psychological questions. However, it seems likely that questions of variability and consistency are profitably separated from each other in many circumstances. Given that intraclass correlations focus squarely on consistency, as articulated in this chapter, it is reasonable that they are the basis of some frameworks for conceptualizing and operationalizing key phenomena in a dynamic, integrative personality psychology (e.g., Ozer, 1986; Shoda, 1999). With this basis, these frameworks do indeed satisfy their goals of focusing on consistency in personality and behavior in a way that facilitates a process-oriented personality psychology that integrates persons, situations, and their interactions.

References Bem, D. J., & Allen, A. (1974). On predicting some of the people some of the time: The search for cross-situational consistencies in behavior. Psychological Review, 81, 506–520. Block, J. (1968). Some reasons for the apparent inconsistency of personality. Psychological Bulletin, 70, 210–212. Bowers, K. S. (1973). Situationism in psychology: An analysis and a critique. Psychological Review, 80, 307–336. Brown, N. A., Blake, A. B., & Sherman, R. A. (2017). A snapshot of the life as lived: Wearable cameras in social and personality psychological science. Social Psychological and Personality Science, 8, 592–600. Cattell, R. B. (1946). Description and measurement of personality. Oxford, England: World Book Company.

III. Methods and statistics

References

Cattell, R. B. (1966). The data box: Its ordering of total resources in terms of possible relational systems. In R. B. Cattell (Ed.), Handbook of multivariate experimental psychology (pp. 67–128). Chicago, IL: Rand-McNally. Conner, T. S., Tennen, H., Fleeson, W., & Barrett, L. F. (2009). Experience sampling methods: A modern idiographic approach to personality research. Social and Personality Psychology Compass, 3, 292–313. Endler, N. S., & Hunt, J. M. (1966). Sources of behavioral variance as measured by the S-R inventory of anxiousness. Psychological Bulletin, 65, 336–346. Epstein, S. (1979). The stability of behavior: I. On predicting most of the people much of the time. Journal of Personality and Social Psychology, 37, 1097–1126. Fleeson, W. (2007). Situation-based contingencies underlying trait-content manifestation in behavior. Journal of Personality, 75, 825–861. Fleeson, W., & Noftle, E. (2008). The end of the personsituation debate: An emerging synthesis in the answer to the consistency question. Social and Personality Psychology Compass, 2, 1667–1684. Furr, R. M. (2009a). Profile analysis in person-situation integration. Journal of Research in Personality, 43, 196–207. Furr, R. M. (2009b). Personality psychology as a truly behavioural science. European Journal of Personality, 23, 369–401. Furr, R. M. (2018). Psychometrics: An introduction (3rd ed.). Thousand Oaks, CA: Sage Publications. Furr, R. M., & Funder, D. C. (2021). Persons, situations, and person-situation interactions. In O. P. John & R. W. Robins (Eds.), Handbook of personality: Theory and research. (4th ed.). Guilford Press. Leising, M. D., & Igl, W. (2007). Person and situation effects should be measured in the same terms. A comment on Funder (2006). Journal of Research in Personality, 41, 953–959. Leising, M. D., & M€ uller-Plath, G. (2009). Person–situation integration in research on personality problems. Journal of Research in Personality, 43, 218–227.

761

Mischel, W. (1968). Personality and assessment. New York: Wiley. Mischel, W., & Peake, P. K. (1982). Beyond deja vu in the search for cross-situational consistency. Psychological Review, 89, 730–755. Morse, P. J., Sauerberger, K. S., Todd, E., & Funder, D. C. (2015). Relationships among personality, situational construal, and social outcomes. European Journal of Personality, 29, 97–106. Ozer, D. J. (1986). Consistency in personality: A methodological framework. Berlin: Springer-Verlag. Rauthmann, J. F., Gallardo-Pujol, D., Guillaume, E. M., Todd, E., Nave, C. S., Sherman, R. A., et al. (2014). The situational eight DIAMONDS: A taxonomy of major dimensions of situation characteristics. Journal of Personality and Social Psychology, 107, 677–718. Rauthmann, J. F., & Sherman, R. A. (2016). Measuring the situational eight DIAMONDS characteristics of situations: An optimization of the RSQ-8 to the S8*. European Journal of Psychological Assessment, 32(2), 155–164. Rauthmann, J. F., Sherman, R. A., & Funder, D. C. (2015). Principles of situation research: Towards a better understanding of psychological situations. European Journal of Personality, 29, 363–381. Shoda, Y. (1999). A unified framework for the study of behavioral consistency: Bridging person x situation interaction and the consistency paradox. European Journal of Personality, 13, 361–387. Wagerman, S. A., & Funder, D. C. (2009). Situations. In P. J. Corr & G. Matthews (Eds.), Cambridge handbook of personality psychology (pp. 27–42). Cambridge: Cambridge University Press. Wood, D., Gardner, M. H., & Harms, P. D. (2015). How functionalist and process approaches to behavior can explain trait covariation. Psychological Review, 122, 84–111. Yao, Q., & Moskowitz, D. S. (2015). Trait agreeableness and social status moderate behavioral responsiveness to communal behavior. Journal of Personality, 83, 191–201.

III. Methods and statistics

C H A P T E R

29 Mobile sensing for studying personality dynamics in daily life Gabriella M. Hararia, Clemens Stachla, Sandrine R. M€ ullerb, and Samuel D. Goslingc,d a

Stanford University, Stanford, CA, United States bColumbia University, New York, NY, United States c The University of Texas at Austin, Austin, TX, United States dThe University of Melbourne, Melbourne, VIC, Australia

O U T L I N E Mobile sensing for capturing personality expression in daily life

765

Guidelines for studying personality dynamics using mobile sensing

768

Design factors to consider before launching a mobile sensing study 768 Description of tutorial materials

771

Analytic techniques to consider when working with sensor data

771

Describing within-person variability

772

Tutorial: Describing within-person variability in conversation behavior 772 Examining patterns over time 774 Tutorial: Examining change in conversation behavior over time 777 Identifying within-person dimensions Tutorial: Identifying within-person components using a P-technique PCA

780

Conclusion

786

References

786

devices, smart-home devices) offer a powerful approach to capturing fine-grained records of people’s thoughts, feelings, and behaviors across occasions (i.e., across time and situations) via mobile sensors and metadata logs. Here, we introduce mobile sensing as a methodological approach for capturing personality dynamics in studies of daily life. To

Abstract People’s thoughts, feelings, and behaviors can vary a great deal over situations and time. Such dynamic patterns are difficult to assess using traditional survey- and lab-based methods. However, new digital media technologies (e.g., smartphones, wearable

The Handbook of Personality Dynamics and Processes https://doi.org/10.1016/B978-0-12-813995-0.00029-7

781

763

# 2021 Elsevier Inc. All rights reserved.

764

29. Mobile sensing for studying personality dynamics in daily life

encourage researchers to use mobile sensing methods, we describe how sensing can be used to study personality and provide a set of guidelines for studying personality dynamics using mobile sensing. Specifically, we discuss procedural factors to consider before launching a sensing study and analytical techniques to consider when studying personality dynamics. We conclude by discussing the promise of mobile sensing for advancing research on personality dynamics.

Keywords Mobile sensing, Sensors, Ambulatory assessment, Personality dynamics, Naturalistic observation

Personality research aims to describe, explain, and assess the characteristic patterns of thinking, feeling, and behaving that distinguish individuals from one another. A great deal of personality research has focused on personality traits, which summarize the degree to which persons systematically differ from one another. Personality traits predict consequential outcomes at individual (e.g., subjective well-being, physical health), interpersonal (e.g., quality of relationships with others), and institutional levels (e.g., occupational satisfaction, community involvement; Ozer & Benet-Martinez, 2006; Roberts, Kuncel, Shiner, Caspi, & Goldberg, 2007; Soto, 2019). In contrast, less is known about personality states—the variability within-persons in terms of thoughts, feelings, and behaviors (Fleeson, 2001). Most studies of personality states examine personality dynamics, or the degree to which a person varies in their states within the context of their daily lives. An understanding of within-person dynamics is important for understanding how personality findings may generalize from the aggregate (averaging across individuals) to the general level (for every individual; Hamaker, 2012). However, researchers have faced prohibitive methodological challenges when trying to capture, “…the behaviors of the person for an extended-enough period of

time to obtain a comprehensive distribution of how the person acts” (Fleeson & Gallagher, 2009; p. 2). Moreover, many studies of personality dynamics use experience sampling to obtain information about a person’s perceived personality state change (e.g., self-reports of momentary extraversion states), which may or may not map onto changes in expressed behaviors or surrounding contexts (e.g., engaging in conversation, spending time in social places). New digital media technologies (e.g., social media platforms, computers, smartphones, wearables, smart-home devices) are now paving the way for studies of personality dynamics in daily life. Digital media are beginning to address the methodological challenges of intensive longitudinal methodologies by permitting real-time, continuous assessment of people’s thoughts, feelings, behaviors, and situations. As part of their natural functioning, digital media contain a set of mobile sensors (e.g., accelerometers, microphones, GPS) and metadata logs (e.g., call and SMS logs, application usage logs) that can be used to record information about people and their surrounding contexts. For example, each time someone checks a social media account from their smartphone, time-stamped behavioral records are generated and stored in the metadata logs of both the social media platform and the device. Such digital records (or digital footprints) can be used to measure elements of an individual’s personality (Kosinski et al., 2013). Past research has used digital records from personal websites (e.g., Vazire & Gosling, 2004), personal blogs (e.g., Yarkoni, 2010), and Facebook profiles (e.g., Back et al., 2010; Youyou, Kosinski, & Stillwell, 2015) to examine how personality gets expressed in daily life, pointing to the promise of digital records as a powerful personality assessment tool. More recently, advances in mobile sensing have made it possible to use wearables, smartphones, and smart-home devices for the collection of behavioral and situational data longitudinally and in an ecologically sensitive manner (e.g.,

III. Methods and statistics

Mobile sensing for capturing personality expression in daily life

Gosling & Mason, 2015; Harari et al., 2016; Harari, Gosling, Wang, & Campbell, 2015; Lane et al., 2010; Miller, 2012). Mobile sensors operate imperceptibly, allowing for unobtrusive, naturalistic observational records that reduce the likelihood that participants will behave reactively (Craik, 2000; Mehl, Pennebaker, Crow, Dabbs, & Price, 2001; Mehl & Robbins, 2012; Rachuri et al., 2010). Mobile sensing is considered to be a form of ambulatory assessment (e.g., Goodwin, Velicer, & Intille, 2008; Wrzus & Mehl, 2015). Ambulatory assessment permits the collection of intensive longitudinal repeated-measures data in the context of people’s daily lives, using selfreport, observational, or physiological approaches (Trull & Ebner-Priemer, 2013; see this volume, Chapter 31 for a review). Generally, ambulatory assessment methods include a variety of self-tracking tools that participants can use for active logging (e.g., experience sampling surveys) and/or passive sensing (e.g., mobile sensors; Miller, 2012). A traditional form of active logging is experience sampling, which involves people reporting on their recent experiences in-the-moment to avoid some of the biases inherent in retrospective self-reports (Csikszentmihalyi & Larson, 2014). In this chapter, we focus on a set of passive sensing tools commonly referred to as mobile sensing, which can be used to objectively capture how personality is expressed in daily life. In practice, mobile sensing methods are often used in study designs that combine both self-report and observational ambulatory assessments to measure people’s subjective experiences and objective information about their behavior and surrounding context, making it an ideal approach for capturing personality dynamics. Here, we introduce mobile sensing as a methodological approach to capturing personality dynamics as they play out in daily life. First, we describe how mobile sensing can be used to capture how personality is expressed in daily life. We then present a set of guidelines for

765

studying personality dynamics using mobile sensing. Specifically, we discuss procedural factors to consider before launching a sensing study and analytical factors to consider when working with sensor data. We focus on three promising directions for personality sensing research: describing within-person variability, examining patterns over time, and identifying withinperson structural dimensions. A tutorial accompanies the set of guidelines to illustrate the suggested analytic techniques using data from the StudentLife study (Wang et al., 2014), which is a publicly available mobile sensing dataset. Our aim is to encourage the widespread use of mobile sensing for the assessment, description, and explanation of a wide range of personalityrelevant behaviors and situations.

Mobile sensing for capturing personality expression in daily life One framework for understanding personality assumes that personality consists of three fundamental elements known as the personality triad—the person, the behavior, and the situation (Funder, 2001, 2006). Theoretically, an understanding of any two elements of the triad should provide information about the third remaining element. For example, if we have knowledge about a person and the type of situation they are in, we might predict what type of behavior they will engage in (Funder, 2001, 2006). The convenience and efficiency of selfreport methods has led to an abundance of research on the person element of the triad. That is, many studies focus on describing and identifying the constructs that characterize an individual’s thoughts and feelings using self-report data. However, much less is known about how objective behaviors and situations manifest in the context of daily life (Baumeister, Vohs, & Funder, 2007; Funder, 2001, 2006; Furr, 2009).

III. Methods and statistics

766

29. Mobile sensing for studying personality dynamics in daily life

Much of what we do know about people’s behaviors and situations stems from proxies for objective measurement (e.g., retrospective or experience sampling self-reports)—not objective quantification of behaviors or situations as they naturally occur (e.g., Baumeister et al., 2007; Furr, 2009; Rauthmann, Sherman, & Funder, 2015). This state of affairs is problematic because self-report data have significant drawbacks, such as being disruptive, time consuming, leading to expectancy effects, being subject to recall biases, memory limitations, and socially desirable responding (e.g., Paulhus & Vazire, 2007). Sensing methods address most of these drawbacks. Moreover, sensing methods can be used in multimethod studies that combine self-report and sensing methods to effectively capture all three aspects of the personality triad. Such multimethod study designs can offer insight about the mechanisms that produce everyday thoughts, feelings, behaviors, and situations (e.g., Baumeister et al., 2007; Furr, 2009). In a personality study, sensor data can be used to capture information about all three elements of the personality triad: the person’s subjective experience (via psychological inferences obtained from language use and voice data, facial expressions, video/image data), objective behaviors (via behavioral inferences obtained from sensor data), and information about situational context (via environmental inferences obtained from location and ambience data). The person’s subjective thoughts and feelings can also be actively queried using experience sampling survey questions, as is typical in most ambulatory assessment research. Adopting such a multimethod approach would also generate self-report data, which could then be combined with passively sensed data. In the following we describe how sensing methods can be used to assess the personality triad, and we summarize the current state of personality sensing research.

The person element may be assessed in several ways, including via language use captured from text or microphone data, and via acoustic properties of speech or behavior captured from microphone data. For example, the person’s subjective thoughts and feelings can be passively sensed using text data collected from phone logs and microphones. Such data about a person’s everyday language patterns can offer insights into a range of psychological states (Tausczik & Pennebaker, 2010) and characteristics (e.g., Park et al., 2015; Schwartz et al., 2013; Yarkoni, 2010). Language patterns in text data collected from text messages, emails, social media posts, or keystrokes are naturally captured by digital media technologies, providing researchers with new ways of measuring personality through people’s natural language use (Boyd & Pennebaker, 2017). Another example is illustrated by research that aims to infer a person’s mental states based on the sound of their voice; such studies tend to examine microphone data to predict emotional states, such as stress from audio features (Lu et al., 2012) or from emotionally relevant behaviors recognized in the audio data (e.g., laughing, crying, arguing, and sighing; Dubey, Mehl, & Mankodiya, 2016). The second element of the personality triad—the behavior—may be assessed in many ways using passive sensing technologies (Harari et al., 2016). A recent review of the behavioral sensing literature provides an initial framework for organizing the lifestyle behaviors that can be measured using sensing methods: physical movement, social interactions, and daily activities (Harari, M€ uller, Aung, & Rentfrow, 2017). The physical movement behaviors typically measured are physical activity and mobility patterns. Physical activity refers to behaviors that describe movement of the human body, such as walking, running, and cycling, which are primarily measured using accelerometer and gyroscope sensors (e.g., Miluzzo et al., 2008). Mobility

III. Methods and statistics

Mobile sensing for capturing personality expression in daily life

patterns refer to behaviors that describe trajectories of human travel, which are typically measured using accelerometers, GPS, and WiFi network data (e.g., Eagle & Pentland, 2009; Saeb et al., 2015; Saeb, Lattie, Schueller, Kording, & Mohr, 2016). The social interaction behaviors typically measured are face-to-face encounters and computer-mediated communication. Face-to-face encounters refer to social interactions carried out in-person without a mediating technology, which are primarily measured using microphones and Bluetooth data (e.g., conversations inferred from microphone data; Chen et al., 2014; Lu, Pan, Lane, Choudhury, & Campbell, 2009; Mehl et al., 2001; Miluzzo et al., 2008). Computer-mediated communication refers to social interactions carried out through an electronic device, which are measured using data from smartphone application-use logs (e.g., text message and phone call frequencies inferred from phone logs; Boase & Ling, 2013; Chittaranjan, Blom, & Gatica-Perez, 2011; Eagle & Pentland, 2006; Harari et al., 2019; Montag et al., 2015; Stachl et al., 2017). Finally, daily activity behaviors typically measured can be categorized into mediated and nonmediated activities. Mediated activities refer to daily behaviors that are carried out using an electronic device, which are measured using smartphone application use logs (e.g., Abdullah et al., 2016; Mehrotra, Hendley, & Musolesi, 2016; Murnane et al., 2016). Nonmediated activities refer to behaviors that people engage in on a day-to-day basis that are not carried out through a device but are in close proximity to it, and which are typically measured using a combination of multiple types of sensor data (e.g., sleeping patterns inferred from a combination of light sensor and phone logs; Chen et al., 2013). The third element of the personality triad is the situation, which may be assessed primarily via information about situational cues (i.e., the

767

who, what, when, and where of a situation; Harari, M€ uller, & Gosling, 2018). For example, sensor data can reveal where a person spends time and what the surrounding environment is like (e.g., whether it is quiet or noisy, dark or light, isolated, or crowded). The GlobalPositioning System (GPS), Bluetooth radio, and WiFi sensors are the most relevant sources of location data for personality sensing research. These sensors provide situational information that places the individual in a physical location (inferred from GPS and/or WiFi data) and social context (inferred from Bluetooth and/or microphone data). Location data can capture the amount of time people spend in various places like their home, gym, or library (inferred from GPS and WiFi scans; e.g., Chon & Cha, 2011; M€ uller et al., 2017; Wang et al., 2014; Wang, Harari, Hao, Zhou, & Campbell, 2015). Such location data can be enriched by combining it with information from other sources, such as information about the type of place visited from Google Places, the visitor ratings of a place from Foursquare, or even economic ZIP-code level data from statistical bureaus. To date, few mobile sensing studies have directly focused on personality. The few existing studies have focused on between-person differences, either using sensor data to predict selfreported Big Five personality trait scores (e.g., Chittaranjan et al., 2011; de Montjoye, Quoidbach, Robic, & Pentland, 2013; Mønsted, Mollgaard, & Mathiesen, 2018; Schoedel et al., 2018) or using self-reported trait scores to predict sensor data (e.g., Ai, Liu, & Zhao, 2019; Stachl, Hilbert, et al., 2017). We know of only a couple studies that have used sensing methods to examine personality dynamics or withinperson variability in people’s behaviors and situations (e.g., Wang et al., 2018). Thus, we next present a set of guidelines designed to facilitate the use of mobile sensing in personality dynamics research.

III. Methods and statistics

768

29. Mobile sensing for studying personality dynamics in daily life

Guidelines for studying personality dynamics using mobile sensing To encourage the use of sensing methods in personality dynamics research, here we describe the procedural and analytical factors to consider when conducting a personality sensing study. Specifically, we summarize the main design factors to consider before launching a personality sensing study (e.g., selecting sensing software, assembling a research team). We then describe three relevant analytic techniques for personality dynamics researchers suitable for use with sensor data: describing within-person variability, examining patterns over time, and identifying within-person structural dimensions. To illustrate the analytic techniques, we created an online tutorial for our chapter (the tutorial is available at our chapter’s OSF page; osf.io/ 3x2ej) using publicly available data from the StudentLife Study (Wang et al., 2014; http:// studentlife.cs.dartmouth.edu/). The overall aim of our guidelines is to facilitate the adoption of sensing methods by personality researchers interested in daily life. We provide an overview of the guidelines in Table 1.

Design factors to consider before launching a mobile sensing study When designing a sensing study, several procedural and logistical factors must be considered. A comprehensive discussion of all the factors is beyond the scope of this chapter, so we direct interested readers to some broader recent reviews (e.g., Harari et al., 2016; Lane et al., 2010; Miller, 2012). Here we focus on five key steps that should be considered before launching a mobile-sensing study. The first step is to select the specific types of sensor data one wishes to collect. Researchers should determine the personality-relevant information they would like to measure and identify

the types of sensor data that can be used to make such inferences. For example, researchers might consult the existing personality literature to identify behaviors that should theoretically be related to personality constructs or they might consult the personality sensing literature to identify sensor data that have been used to measure personality constructs in past studies. Table 2 provides an overview of the types of sensor data that are commonly collected in sensing studies and maps them onto the components of the personality triad they measure. The second step to consider before launching a mobile-sensing study is the set of plans for ensuring data security and respecting participants’ privacy. For example, the level of data granularity must be weighed against privacy concerns. Moreover, different data-privacy regulations might be in place depending on where the study will be conducted. Such data regulations may require various technical and legal precautions for different types of sensor data. As of May 2018, for example, studies collecting or processing person-related data from citizens of the European Union (EU) are subject to tightened privacy requirements as a result of the EU General Data Protection Regulation (2018) (https://www.eugdpr.org/). Such requirements suggest voice data, GPS positions, app-usage records, and so on can be considered personal data and must be protected accordingly (e.g., via two-factor authentication, strong encryption). Thus, researchers should spend time considering what kinds of sensing data they need to answer their research questions, and at what level of granularity it will be collected. Moreover, participants should be well informed about the types of sensing data being collected from their devices. The third step to consider before launching a sensing study is the composition of the research team. That is, researchers should assemble a team with relevant skills to assist with study setup, data collection, and data analysis. In most cases, the research team will comprise

III. Methods and statistics

Design factors to consider before launching a mobile sensing study

TABLE 1

769

Guidelines for studying personality dynamics using mobile sensing.

Guidelines

Description

Study design factor to consider 1. Selecting the types of sensor data to collect

Determine the personality-relevant information that needs to be measured (e.g., specific behaviors) and identify the types of mobile sensor and metadata that can be used to make such inferences

2. Data security and participant privacy

Ethical and legal regulations can vary by institution and country, and should be explored to ensure participants provide informed consent and the research team uses responsible data practices (e.g., for collection, storage, and sharing of the smartphone data)

3. Composition of the research team

The skill sets of the individual members of the research team (e.g., computer and mobile app programming, data science techniques) will influence the way the study is set up, data collection is managed, and data analysis is approached

4. Selecting sensing software for data collection

The selection of sensing software should be based on the composition of the research team and the type of data required to answer the research questions of interest

5. Preregistration of study design and analysis plans

Preregistering studies can increase the replicability and credibility of research by restricting researcher degrees of freedom and questionable research practices

Analytic techniques to consider 1. Describing within-person variability

Smartphone data can be used to describe variability in personality expression to answer two broad research questions: (1) How much do people vary from one another (between-person) and themselves (within-person) in their everyday behavior? (2) What individual differences are associated with between- and within-person variability in everyday behavior?

2. Examining temporal patterns

Smartphone data can be used to examine personality change over time to answer three broad research questions: (1) On average, how do sensed behaviors change over time? (2) Is there individual variability in the observed change patterns over time? (3) Do individual differences predict patterns of change in sensed behaviors?

3. Identifying within-person structural dimensions

Smartphone data can be used to examine within-person dimensions to answer the following broad research question: Which sensed variables tend to group together over time with an individual?

psychologists or other social scientists, often with limited programming skills. In such cases, the researchers may elect to use a commercial application software because the company would handle the backend system and manage the technical elements of data collection. If the researcher has sufficient funding, however, they may also elect to hire a programmer to assist with the technical set up of an open source sensing system. If the research team includes individuals with programming skills

(e.g., students with computer science backgrounds, or crossdisciplinary collaborations with colleagues in computer science departments), the researcher may elect to use a custom application. The fourth step to consider is the sensing software that will be used for data collection. The selection of sensing software should be guided by the composition of the research team and the type of data required to answer the research questions of the study. At the time of writing,

III. Methods and statistics

770 TABLE 2

29. Mobile sensing for studying personality dynamics in daily life

Overview of the types of sensor data that capture components of the personality triad.

Type of sensor data

Components of the personality triad Persons

Situations



Accelerometer Microphone

Behaviors





Light sensor

Example references Wang et al. (2014, 2015) and Rabbi et al. (2011)



Dubey et al. (2016), Harari et al. (2019), Lu et al. (2009) and Wang et al. (2014, 2015)



Tseng et al. (2016) and Wang et al. (2015)

Bluetooth scans





Chen et al. (2014) and Eagle, Pentland, and Lazer (2009)

GlobalPositioning System (GPS)





Eagle and Pentland (2009) and Saeb et al. (2015)

WiFi scans





Chon and Cha (2011)





B€ ohmer, Hecht, Sch€ oning, Kr€ uger, and Bauer (2011), Harari et al. (2019), Mehrotra et al. (2016), Montag et al. (2015), Murnane et al. (2016) and Stachl et al. (2017)

Metadata logs



smartphone sensing applications are the most widely used form of sensing software and there are several options to choose from. For example, the sensing software could be commercially available (e.g., movisensXS, Ethica), open source (e.g., AWARE Framework, Sensus, EmotionSense libraries), or custom built for the study (e.g., StudentLife, PhoneStudy). When selecting an application, the researcher should consider two main factors: whether the application software permits the collection of the data they are interested in and the level of technical experience the research team requires to collect data with the sensing software. If using a custom application, it is ideal if the research team includes people with technical skills in programming servers, managing databases, and working with large datasets (Harari et al., 2016). Such computational skills prove to be useful for the research team at various stages of the study (e.g., during study design, during technical setup of sensing system and databases, when managing data collection and working with large sensor data files during analysis).

The final step to consider before launching a sensing study is whether (and how) the study design and analysis plans will be preregistered. To increase the replicability and credibility of research, there is general consensus that researchers should aim to preregister studies before any data are collected (Gonzales & Cunningham, 2015). Preregistration aims to restrict researcher degrees of freedom and prevent questionable research practices, such as HARKing (e.g., hypothesizing after the results are known; Kerr, 1998) and p-hacking (e.g., manipulating data collection or analyses to obtain significant results; Head, Holman, Lanfear, Kahn, & Jennions, 2015; Simmons, Nelson, & Simonsohn, 2011). Such practices lead to overfitting of models and to more positive statistical test results. Thus, before data collection begins, we recommend researchers preregister their plans at a suitable website (e.g., www.osf. io, https://aspredicted.org/). A proper preregistration usually includes detailed information about research questions, hypotheses, planned sample characteristics, collected variables, study

III. Methods and statistics

Analytic techniques to consider when working with sensor data

design, and analytic plan. To facilitate preregistration of sensing studies, we have provided a preregistration template specifically designed for sensing studies on this chapter’s OSF page (https://osf.io/3x2ej/).

Description of tutorial materials The tutorial materials illustrating the following analytic techniques can also be found on this chapter’s OSF page (https://osf.io/3x2ej/). The tutorial materials include a series of R scripts that will permit the reader to follow along with the guidelines discussed in the chapter. The R script includes additional comments to assist the reader through the data processing and analytic steps using data from a publicly available sensing dataset from the StudentLife Study (Wang et al., 2014). StudentLife was one of the first studies to use passive sensing data to assess the mental health, academic performance, and behavioral trends of a cohort of 48 students over the course a 10-week academic term. Detailed information about the StudentLife study and the dataset are available online at the project’s website (http://studentlife.cs. dartmouth.edu/dataset.html). The StudentLife dataset includes many types of sensor data, but here we focus on microphone, accelerometer, and Bluetooth data. We used data from these three sensors to illustrate some behavioral and situational inferences particularly relevant to personality. First, we used data from the microphone sensor, focusing on conversation behavior inferences (conversation duration and conversation frequency) and acoustic environment inferences (whether the device is around silence, noise, or voices). The audio classifier distinguished between silence, voice, noise, and unknown acoustic features from the microphone data. The classifier used duty cycling to run on the phones, which means that audio inferences were made for 1 min and paused for 3 min. If the

771

classifier detected voices, it would continue to run until the end of the conversation. Second, we used data from the accelerometer sensor. The accelerometer data provided information about the type of activity a participant was engaging in at the time of recording. The activity classifier distinguished between stationary, walking, running, and unknown. In a similar fashion to the audio classifier, activity was recorded using continuous duty cycling (i.e., activity inferences are made every 3 min for 1 min each). Third, we used data from the Bluetooth sensor. The Bluetooth data consisted of the number of Bluetooth devices around the participant, which were identified via Bluetooth scans every 10 min. Bluetooth scans are often used as a proxy for the presence of other colocated people in a location (e.g., Chen et al., 2014).

Analytic techniques to consider when working with sensor data Once the data have been collected, there are several steps to consider when processing and analyzing mobile sensing data. One of the first data-processing steps involves creating (or finding pre-existing) data processing scripts to extract the personality-relevant inferences from the sensor data (see Harari et al., 2017 for a review of classifiers used to infer behaviors). After processing the data to extract the behavioral inferences, the continuous data must be aggregated to an appropriate unit of analysis (e.g., converting minute-by-minute inferences to daily-level estimates). The selection of an appropriate unit of analysis should be guided by the research questions of the study (e.g., whether time-series or aggregated measures are needed). If aggregation is selected, consideration should be given to understanding the extent to which the aggregate measures are stable (or consistent) and reliable over time. Once the behavioral inferences have been aggregated, the researcher should examine the estimates to ensure the appropriate modeling

III. Methods and statistics

772

29. Mobile sensing for studying personality dynamics in daily life

strategies are adopted. For example, in deciding the modeling approach to be used, the following factors should be considered: (1) the unit of analysis of any outcome variables of interest (e.g., one-time vs. repeated measures), (2) the variability due to between- (i.e., individuals) and within-person (i.e., time) factors, (3) autocorrelation among the behavioral measurements, and (4) the cyclical nature of the behavioral measurements. We present now three analytic techniques that personality dynamics researchers should find helpful when working with sensor data. Specifically, we describe how one might use sensor data to describe within-person variability, examine temporal patterns, and identify within-person structural dimensions.

Describing within-person variability Sensing methods provide continuous information about people’s behavior, allowing for the estimation of density distributions (Fleeson, 2001). The density distributions approach “… involves observing people as they conduct their daily lives and measuring a large number of their behaviors in a manner that allows their similarity to be assessed” (p. 84). If a person exhibits high amounts of within-person variability, that means the person behaves very differently from measurement occasion to occasion (e.g., hour to hour, day to day, week to week); if a person exhibits low amounts of withinperson variability, that means the person behaves very similarly from measurement occasion to occasion. In general, the higher the amount of variability a person exhibits, the less useful trait labels are when describing the person because it is a less accurate description of the person’s behavior at any one time. Variability analyses with sensing data can reveal the descriptive patterns of variability in behaviors between and within individuals over time, and psychological correlates of the observed density distributions can be examined.

Such analyses will answer two broad research questions: (1) How much do people vary from one another (between-person) and themselves (within-person) in their everyday behavior? (2) What individual differences are associated with between- and within-person variability in everyday behavior?

Tutorial: Describing within-person variability in conversation behavior The first step to working with sensor data, as with other types of quantitative data, is to carefully investigate the basic descriptive statistics (e.g., mean, standard deviation, and skewness) of the variables of interest. The descriptive properties of the dataset will aid researchers in determining how best to model the variables of interest to answer their research questions. To prepare the conversation data for analyses examining within-person variability across days of the study, we transformed the continuous conversation estimates into daily aggregates. Specifically, we created 67 day variables, each representing the within-person sum duration of conversation sensed per 24-h day of the study. We examined the variance in the conversation behavior data to determine whether most of the variability stems from between-person factors (different persons) or within-person factors (different days). To examine the variance in daily conversation, the between- and withinvariation were plotted alongside the total variation to identify whether students showed greater between- or within-person variability in their daily conversation behaviors. When most of the variance is due to between-persons factors (i.e., individuals), a two-level modeling approach (i.e., with fixed and random effects) likely provides an appropriate model. When most of the variance is due to within-persons factors (i.e., time), the addition of residual covariance provides a more appropriate model. Fig. 1 shows that for both conversation

III. Methods and statistics

773

Describing within-person variability

100%

100%

FIG. 1 The amount of within- and betweenperson variance in the conversation behaviors of 49 students across 67 days of an academic term.

100% 83%

80%

73%

60% 40%

27% 17%

20% 0%

Duration

Frequency Total Within

Between

behaviors, there was greater within-person variation than between-person variation. These findings suggest that the students in the sample tended to differ from themselves in their conversation behaviors more than they did from one another over time. These findings suggest that a two-level modeling approach with fixed and random effects would likely provide an appropriate model, with the possible inclusion of additional residual covariance.

TABLE 3

Next, we examined the density distributions for students’ daily conversation behaviors. To examine the density distributions, we numerically summarized the daily level estimates using the interindividual [M] and intraindividual [iM] mean, interindividual [SD] and intraindividual standard deviation [iSD], interindividual [Skew] and intraindividual skew [iSkew], interindividual [Kurtosis], and intraindividual kurtosis [iKurtosis]. Table 3 presents a summary of the

Density distribution estimates for daily conversation behaviors. Duration

Frequency

Interindividual

Intraindividual

Interindividual

Intraindividual

M



10.77



28.44

SD

3.17

13.80

8.56

13.25

Min

4.06

1.25

11.12

3.43

Max

19.20

57.31

60.97

65.16

Range

15.14

56.06

49.85

61.73

Skew

.54

1.98

1.03

.37

Kurtosis

.13

4.57

2.83

.34

Std. Error

.45

2.98

1.22

1.81

Note: N ¼ 49. Number of days per user ranged from 22 to 67. Interindividual estimates present between-person statistics. Intraindividual estimates present within-person statistics. The interindividual M is the same as the intraindividual M, so it is omitted from the table.

III. Methods and statistics

774

29. Mobile sensing for studying personality dynamics in daily life

density distribution statistics for the daily conversation durations and frequencies. To visualize the density distributions, we created density plots to illustrate the usual level of a person’s behavior (and their within-person variation), and the differences between people. Fig. 2 presents the density plots for conversation duration and frequency across the days of the study for two students. The plots show the variability in the conversation behavior patterns between-persons (by comparing the shape of the distributions for the two students) and within-persons (by comparing the width of the distributions for each student). As a final step, we examined the individual differences related to the density distributions estimates. Specifically, we correlated the density distribution estimates with self-reported Big Five trait ratings to identify relationships between the observed conversation variability and personality constructs. To avoid redundancy, the analysis of individual differences and between-person density distributions do not include the interindividual mean estimates [M]. Table 4 presents the correlations between the conversation behaviors and the self-reported Big Five personality traits. Given the small sample size and size of the effects, none of the findings are statistically significant. Nonetheless, we present them here to illustrate the analytic technique. Such analyses point to the promise of combining self-report data with sensor data, but they require large-sample studies to establish the reliability of observed patterns.

Examining patterns over time A second technique researchers may want to consider examines how personality expression changes over time. Thus, our next set of analyses examines longitudinal patterns of change in sensed behaviors and identifies predictors associated with the observed change patterns. Such analyses aim to answer three broad research

questions: (1) On average, how do sensed behaviors change over time? (2) Is there individual variability in the observed change patterns over time? (3) Do individual differences predict patterns of change in sensed behaviors? In practice, the longitudinal modeling technique should be selected after inspecting the features of the sensing data to be modeled. That is, the modeling technique should be selected in a data-driven way to ensure the model provides optimal estimation of the observed patterns of change given the qualities of the dataset. For example, personality researchers might consider multilevel and time-series models appropriate when modeling nested sensor data (Cheng, Edwards, MaldonadoMolina, Komro, & Muller, 2010). To assist in the selection of an appropriate model, researchers should examine the given behavior to determine the characteristics of the sensor data (e.g., whether the observations are independent or show cyclical patterns). Time-series approaches to consider include (1) parametric methods, which assume the stationary stochastic process has a structure that can be described using several key parameters, and (2) nonparametric methods, which do not assume the stochastic process has an underlying structure. Parametric methods in time series include the autoregressive and moving-average models (and combinations of these), which seek to estimate parameters of a model to describe a stochastic process. Nonparametric methods, on the other hand, seek to estimate the covariance of a process. To illustrate the time series nature of the daily level data, Fig. 3 presents the patterns in daily average conversation behaviors across the 67 days of the study. Growth models are a commonly used technique for modeling longitudinal development in variables over time (Grimm, Ram, & Estabrook, 2016). Growth models were traditionally used to model few time points for many participants, but can also serve as a framework for fitting large datasets with many participants

III. Methods and statistics

Density distribution of daily duration for two participants

Density distribution of daily frequency for two participants 0.04

0.15

0.10

Density

Density

0.03

0.05

0.01

0.00

0.00

0 25 50 75 20 30 (B) Frequency Duration Density distributions for two students’ daily conversation behaviors: (A) conversation duration distributions, (B) conversation frequency distributions. 0

(A) FIG. 2

0.02

10

776

29. Mobile sensing for studying personality dynamics in daily life

TABLE 4

Correlations between daily conversation behaviors and personality traits. Personality traits

Conversation behavior

Openness

Conscientiousness

Extraversion

Agreeableness

Neuroticism

Duration Correlation

.03

.15

.04

.05

.03

P-value

.85

.32

.77

.71

.82

.95 lower CI.

.25

.20

.34

.26

.27

.95 upper CI

.29

.44

.30

.35

.35

Correlation

.01

.06

.08

.18

.07

P-value

.94

.71

.61

.24

.65

.95 lower CI.

.30

.27

.23

.14

.39

.95 upper CI

.30

.36

.38

.46

.24

Frequency

Note. N ¼ 47. Spearman’s correlations were computed between Big Five trait ratings and the daily conversation behavior estimates of the intraindividual mean. All correlations were not statistically significant at P < .05.

and many time points, making them particularly useful for examining the personality dynamics captured by sensor data. For example, a person might exhibit weekly patterns in extraverted behavior (lower amounts of social interaction during the work week and higher amounts

during the weekends), which could be captured in the social interactions going on around the person and those mediated through the device itself. The simplest growth model is the linear growth model. Linear growth models can be fitted using several different frameworks, such

Daily trends in mean conversation duration

Daily trends in mean conversation frequency

Mean conversation frequency

Mean conversation duration

20

15

10

5

40

30

20

10

0

0

0 20 40 60 40 60 Day in the term (B) Day in the term FIG. 3 Average daily conversation behaviors for 49 students across 67 days: (A) conversation duration, and (B) conversation frequency. Vertical bars indicate standard errors of the mean.

(A)

0

20

III. Methods and statistics

Examining patterns over time

as the multilevel modeling or structural equation modeling framework. Here we adopt the multilevel modeling approach to measure growth in the conversation behavior patterns. Multilevel models to consider include (1) generalized linear mixed models for normally distributed data, and (2) nonlinear mixed effects models for non-normal data (e.g., Poisson, exponential). Multilevel models considered should include both the fixed (mean) and random (covariance) effects. The fixed effects describe the mean model for the sample. The random effects account for the variance betweenpersons, and the covariance patterns for the variance within-persons. To fit the most appropriate multilevel model, the analyses should start with a model of the fixed effects for both intercepts and slopes. Subsequent models should be compared in an iterative fashion using fit statistics (e.g., AIC, BIC). The researcher would then proceed to model: (1) random intercept models, in which intercepts vary and slopes are fixed, 2) random slope models, in which intercepts are fixed and slopes vary, and (3) random intercept and slope models, in which both intercepts and slopes vary. During the analyses, collinearity and assumption diagnostics should be assessed. If collinearity is an issue, redundant variables may be dropped from the analysis or dimension reduction techniques may be used to help reduce collinearity. If the behavior data show evidence for autocorrelation or cyclical behavior over time, time-series models should be considered.

Tutorial: Examining change in conversation behavior over time Do the college students in our sample experience the same behavior changes, or do individual differences exist in the direction or rate of change in their conversation behavior trajectories? To examine behavior-change patterns, this set of analyses models longitudinal patterns in the sensed

777

conversation behavior. Overall, the goal of the analyses was to identify any normative patterns of change in the behaviors throughout the semester, and whether there was individual variation from the normative pattern. We also examined to what extent self-reported individual differences predicted the observed change patterns. If the students’ conversation behavior patterns were affected by normatively shared college experiences (e.g., exam periods), we would expect to see mean-level changes in the students’ average behavior trajectories over time. The behavior trajectories may show individual-level change as a result of unique experiences (e.g., not being able to make new friends) or unique reactions to the college experience (e.g., staying at home and studying vs. going out and partying). If this were the case, we would expect to see significant variability around any mean-level trajectories identified for the conversation behavior. Variability in students’ behavioral trajectories seems likely given that previous research has found students differ from one another in their behavior depending on demographic (e.g., academic class; McArthur & Raedeke, 2009; Tsai & Li, 2004) and personality characteristics (e.g., Friedman, 2000; Smith, 2006). We prepared the data for growth modeling analyses by organizing the estimates in a nested format with conversations by week (level 1) nested within participants (level 2). We aggregated our conversation estimates to the weekly level to permit estimates of week-by-week change in the conversation behaviors across the 10-week term. Specifically, we created 10 variables per participant, representing the sum duration and frequency of conversations for each week in the study. As an initial exploratory step, the conversation behavior patterns were visually explored for trends. Such plots assist in determining whether a growth model is appropriate for representing the patterns observed in the data. Fig. 4 displays individual trajectories in the duration and frequency of conversations for a

III. Methods and statistics

778

29. Mobile sensing for studying personality dynamics in daily life

Raw data of frequency for 10 participants

Raw data of duration for 10 participants 20 Mean conversation frequency

Mean conversation duration

40 15

10

5

30

20

10

0

0 1

2

3

4 5 6 7 Week in the term

8

9

10

4 5 6 7 8 9 10 Week in the term FIG. 4 Plots of the mean weekly trends in conversation behavior for 10 participants: (A) conversation duration and (B) conversation frequency.

(A)

randomly selected subsample of 10 participants. The figure suggests that there is variation in both the intercept and slopes over time. In addition, plots of the means over time were used to visually identify average-level trends in the behaviors (e.g., linear, quadratic, cubic). Fig. 5 presents the mean trajectories of the

1

(B)

2

3

conversation behavior across weeks. We used an examination of the mean plots to guide the selection of the mean structure trend to be modeled for the conversation behavior. For the purposes of our illustration here, we use linear growth models to analyze weekly trends, with a focus on conversation duration.

Weekly trends in mean duration

Weekly trends in mean frequency

20 Mean conversation frequency

Mean conversation duration

40 15

10

5

30

20

10

0

0

1 2 3 4 5 6 7 8 9 10 4 5 6 7 8 9 10 (B) Week in the term Week in the term FIG. 5 Average (A) weekly conversation duration and (B) conversation frequency across 10 weeks. Number of students with data for each week ranges from 38 to 49. Vertical bars indicate standard errors of the mean.

(A)

1

2

3

III. Methods and statistics

779

Examining patterns over time

Table 5 presents the results for five iterative growth models capturing change in participants’ average weekly conversation duration. We followed the iterations as presented in Finch, Bolin, and Kelley (2014) by first modeling intercept-only models and then randomcoefficient models. Within each group of models, we show both a model with a level 1 predictor (in our case, time as the semester week) and a model with an added level 2 predictor (in our case, participants’ self-reported extraversion score). Each model represents an extension of the previous model in an attempt TABLE 5

to improve the prediction of average weekly conversation durations. To begin, we created a null model (Model 1), which contained no independent variable and allowed only the intercept to vary (as a function of the included participants). The null model helps explain the structure of the data and will be helpful when comparing more complex models. The AIC (2362.47) and BIC (2374.73) values are of primary interest. Moreover, the ICC (not presented in the table) for this model is .4790, indicating that approximately 48% of the variance is attributable to between-person

Summary of five iterative growth models capturing linear change in weekly conversation duration. Random effect (SD)

Models

AIC

BIC

logLik

Fixed effects

Intercept Intercept Week Extraversion B (df) P

Week B (df) P

Extraversion B (df) P

Intercept-only models Model 1: Null model

2362.47 2374.73 1178.24 2.95

10.67 (394) .00

Model 2: Model 1 + Fixed effect (L1)

2341.63 2357.97 1166.82 2.98

12.11 (393) .00 .27 (393) .00

Model 3: Model 2 + Fixed effect (L2)

2342.52 2362.94 1166.26 3.01

9.93 (393) .44

.27 (393) .00 .12 (45) .87

Random coefficient models Model 4: Model 3 + Random effect (L1)

2331.31 2355.82 1159.66 3.81

.36

Model 5: Model 4 + Random effect (L2)

2338.22 2379.04 1159.11 3.83

.36

12.15 (393) .00 .28 (393) .00

.00

8.71 (393) .48

.28 (393) .00 .18 (45) .79

Note. N ¼ 47 participants with K ¼ 441 daily values. B ¼ unstandardized regression coefficient; df ¼ degrees of freedom; P ¼ P value. Models shown are predicting participants’ average weekly conversation duration. Intercept-only models let the intercept vary across persons, while random coefficient models let both the intercept and the slope vary across persons. L1 indicates the inclusion of a Level 1 variable in the model (in our case, time or the semester week), and L2 indicates the inclusion of a Level 2 variable in the model (in our case, a participant’s extraversion score). Model fit indices: AIC, Akaike information criterion; BIC, Bayesian information criterion; logLik, Log-likelihood.

III. Methods and statistics

780

29. Mobile sensing for studying personality dynamics in daily life

differences in average weekly conversation duration, underlining the importance of using a multilevel model. Next, we added an independent variable (in our case, time represented as the semester weeks 1–10) to the model. As smaller values for AIC and BIC reflect better model fit and the AIC and BIC values for Model 2 are smaller than those of Model 1, we concluded that Model 2 provided a better fit to the data. In Model 3, we added extraversion as another predictor of conversation duration (on the level of the participant). This model showed a worse fit compared to Model 2. Second, we moved on to random coefficient models, which allow both the intercept and the slope to vary for each included participant. A better model fit emerged when letting both the intercept and the slope for the time effect vary (Model 4), while a poorer model fit emerged when letting the slope for extraversion vary too (Model 5), suggesting that Model 4 is the best fitting model.

Overall, the growth model results show that time is significantly related to average duration of conversation (P < .001). More specifically, the negative coefficient indicates that students had shorter conversations as the semester progressed. Moreover, the random variance of coefficients for this variable (.36) is much larger than that of extraversion (.00), indicating that the relationship between time and average conversation duration varies more across participants than does the relationship between extraversion and average conversation duration.

Identifying within-person dimensions A third analytic technique for personality researchers working with sensor data to consider is the identification of within-person dimensions. To illustrate this approach, consider the three-dimensional Cattell’s data box in Fig. 6, which represents the characteristics of datasets with regard to persons (P),

Interindividual analysis (R-Technique) Variables

Intraindividual analysis (p-Technique)

Persons/ Entities

Occasions

FIG. 6

The Cattell data box. Reprinted from Molenaar, P. C., & Campbell, C. G. (2009). The new person-specific paradigm in psychology. Current Directions in Psychological Science, 18(2), 112–117. doi:10.1111/j.1467-8721.2009.01619.x).

III. Methods and statistics

Identifying within-person dimensions

occasions/situations (S), and time (T; Molenaar & Campbell, 2009). Several methods exist for the analysis of intraindividual variability, as represented by the P-dimension of the box (Cattell, 1952). For example, while growth modeling is focused at the analysis of differences in intraindividual change across individuals, the so-called P-technique can be used to identify intraindividual structural dimensions among repeatedly measured variables using factor analyses. Thus, our next set of analyses examined within-person dimensions in sensed data. Such analyses aim to answer the following broad research question: Which sensed variables tend to group together over time within an individual? The P-technique approach is useful for the investigation of personality under the assumption of dynamic structures within a single subject. Methodologically, the classic P-technique is equivalent to a standard factor-analysis (Rtechnique, performed on a correlation matrix), assuming underlying latent constructs. In both analyses, correlational structures in the data are used to group and consequently reduce the number of variables to a smaller number. Whereas factor analysis is commonly used to analyze multiple variables of multiple subjects at one point in time, for personality dynamics research, it would be more relevant to investigate multiple variables of a single person, measured repeatedly over time (Molenaar & Nesselroade, 2009). Findings from such analyses would prove useful as therapies, services, and products become increasingly personalized to individuals’ personalities. For example, psychotherapists could investigate intraindividual relationships between a patient’s behaviors (sensed via smartphones) and their self-reported state anxiety score (collected via experience sampling). However, there can be a myriad of sensed behaviors, so it might be a good idea to reduce those behaviors to a few underlying dimensions. Such a reduction approach might

781

also reveal behavioral dimensions, that specific to an individual, can be associated with answers on the anxiety scale. For example, state anxiety could be measured repeatedly with experience sampling methods and could then be correlated with the extracted behavioral dimensions. By doing so, person-specific variability in behavioral dimensions (or single behaviors) could be associated with variability in self-reported state anxiety scores. Furthermore, such patterns may be highly specific to single individuals, making it likely that they might be averaged out in more common interindividual analyses. In general, the analytical approach to achieving this goal depends on whether the researcher has prior expectations about the number and composition of potentially underlying dimensions (confirmatory as opposed to exploratory). The classical factor analysis technique can be used in a confirmatory fashion if the correlational structures between the variables of interest are assumed to be caused by one or more known common latent variables (e.g., behaviors and personality items). In this regard, factor analysis aims to group the common variance of all variables into a reduced number of factors. However, if no underlying latent variable structure is assumed (as in the present exploratory analyses) and the researcher simply aims to obtain a smaller number of behavioral dimensions that capture as much of the variance as possible, a basic principal component analysis (PCA) can be used for dimensionality reduction. Thus, here we present a PCA to group various sensed behavioral and environmental variables over time.

Tutorial: Identifying within-person components using a P-technique PCA We used behavioral records from the StudentLife dataset to fit a within-person PCA, focusing on data collected from the microphone, accelerometer, and Bluetooth scans. In order to fit the PCA, we created a separate dataset containing

III. Methods and statistics

782

29. Mobile sensing for studying personality dynamics in daily life

the frequency of each variable, aggregated for each day of the study. In Fig. 7, the time-series data are presented for all included variables, plotting the observations over the days of the study. Note that due to the large differences in event frequency, we have plotted events with similar frequency ranges separately. A prerequisite for dimension reduction techniques is the presence of significant correlations between the observed variables (about .30; Tabachnick & Fidell, 2013). Thus, we calculated 25000

bivariate correlations and investigated scatterplots first. This step reveals whether linear statistical relationships exist in the data and consequently whether a factor analysis will be helpful. Additionally, a series of procedures can be performed to determine the feasibility of dimensionality reduction (e.g., KMO measure of factoring adequacy), which were developed when dimensionality reduction was done by hand to help researchers make sure they spent their time working on the calculation of

Behavior over time

Frequency

20000 variable freq_stat freq_silence freq_voice freq_noise

15000 10000 5000 0 1200

Apr 01

Apr 15

May 01

May 15

Jun 01

Frequency

900 variable freq_walk freq_run freq_blue

600 300 0 Apr 01

Apr 15

May 01

May 15

Jun 01

Frequency

60

40

variable freq_conv

20

0 Apr 01

FIG. 7

Apr 15

May 01 Date

May 15

Multivariate time series of audio, physical activity, and Bluetooth logs.

III. Methods and statistics

Jun 01

Identifying within-person dimensions

appropriate factor solutions. However, these procedures are mostly of historical relevance these days given the speed of current computation for dimension reduction analyses, so we do not report them in this chapter (we point interested readers to Bartlett, 1951; Cerny & Kaiser, 1977, Grice, 2001 for more detailed discussion of these techniques). Fig. 8 shows pairwise correlations between all variables included in the PCA. The results revealed substantial correlations between the variables. Specifically, the absolute values of the correlations ranged from .00 to .90 with an absolute mean intervariable correlation of .23. Furthermore, it is clear that some variables were correlated (e.g., freq_conv * freq_stat) and others were not (freq_stat * freq_run). One of the most crucial steps in a factor analysis (and one of the most debated) is the identification of the right number of dimensions. Multiple approaches to undertaking this step exist; however, no universally agreed upon method has been identified (Fabrigar, Wegener, MacCallum, & Strahan, 1999). Originally, Cattell proposed the visual “Scree-Test” criterion, in which a significant drop in eigenvalues is used as the decision criterion. To identify this drop, the plot is visually inspected from left to right for a distinct “elbow.” Obviously, this approach is also highly subjective and should therefore not be considered as a first choice. Although several other decision criteria exist, we will use the more objective parallel analysis to decide on the number of dimensions to extract (Fabrigar et al., 1999; Horn, 1965). In Fig. 9, results of the parallel analysis are shown. The parallel analysis compares eigenvalues of the correlation matrix with those of random data generated to have the same number of observations and variables. Once the simulated eigenvalues exceed those of the real matrix, it can be assumed that mostly noise is modeled and thus that irrelevant factors are extracted. In our case, the parallel analysis suggested the extraction of two components. However, the

783

decision of how many factors to extract is not always straightforward. In the case of an unclear solution, a number of additional criteria, such as the Very Simple Structure criterion (VSS, Revelle & Rocklin, 1979), the Minimum Average Partial criterion (Velicer, 1976), and more lately the Bass-Ackward analysis (Goldberg et al., 2006) have been developed to help identify the appropriate number of dimensions. As suggested by the parallel analysis, we ran the PCA and extracted two components. To facilitate the interpretation of the extracted dimensions, factor solutions are commonly rotated. The rotation of factors maximizes high correlations between variables and factors and minimizes small correlations. In practice, many different approaches to factor rotation exist. In general, those can be divided into oblique and orthogonal rotation techniques. Oblique rotations are often adequate for psychological research questions because they allow for factors to be correlated. Hence, we used the oblique promax rotation for the analysis. In Table 6, we can see that certain variables loaded higher on Component 1 and others loaded higher on Component 2. Specifically, Component 1 suggests that measurement occasions (days in this case) characterized by more silence were also more likely to display stationarity, conversations, and more nearby Bluetooth devices. Component 2 suggests that measurement occasions characterized by more noise were also more likely to feature walking, voices, and running inferences. These components suggest the general distinction between days spent indoors (Component 1: silence, stationarity, conversations, and other Bluetooth devices) and outdoors while people are in transit (Component 2: noise, high physical activity, and voices). Although “conversation” also loaded on the second component, the variables “voice” and “conversation” mainly load on separate components. This counterintuitive finding could indicate that the conversation classifier was better at classifying conversations in the

III. Methods and statistics

784

29. Mobile sensing for studying personality dynamics in daily life 4000

8000

0 100

300

0

10000

25000

0

400

800

freq_conv

0.22

0.01

0.48

0.38

0.27

0.27

0.00

0.00

0.90

0.13

0.14

0.13

0.24

−0.09

0.25

0.57

0.00

0.00

0.15

0.29

−0.15

0.09

−0.12

0.19

0.63

0.17

8000

0

20

0.42

40

60

0

0

4000

freq_stat

0 200

600

freq_walk

0 100

300

freq_run

25000

0

10000

freq_silence

freq_noise

0

−0.01

4000

10000

0

10000

freq_voice

0

400

800

freq_blue

0

FIG. 8

20

40

60

0 200

600

0

10000

0

4000

10000

Panel plot of all variables used in the PCA.

immediate surroundings of the participant (e.g., participant speaking with another person), but still was able to pick up other voices in the ambient environment (e.g., on the street, while walking). Also, the classifiers used for the detection

of conversations required 10 min of continuous voice input in order to label an observation as a “conversation” in the data files. In conclusion, we have demonstrated how to fit a PCA on multivariate time series data

III. Methods and statistics

785

Identifying within-person dimensions Parallel analysis scree plots

1.5 1.0 0.5 −0.5

0.0

eigen values of principal factors

2.0

2.5

PC Actual Data PC Simulated Data PC Resampled Data FA Actual Data FA Simulated Data FA Resampled Data

1

2

3

4

5

6

7

8

Factor number

FIG. 9

Results from parallel analysis scree plots.

TABLE 6

Pattern-matrix for principal component analysis. Component 2

h2

Sensor data variable

Component 1

Silence

.96

.89

Stationarity

.90

.80

Conversation

.63

Bluetooth devices

.39

.34

.59 .15

Noise

.91

.81

Walking

.75

.55

Voice

.72

.58

Running

.47

.22

Note. Sensor data variables for this analysis are all in frequencies. The analysis was performed on the basis of 65 measurement occasions (days). Loadings of all variables on two extracted components are given. Factor loadings below .20 were removed for reasons of clarity. Communalities (h2) show the proportion of variance in a variable explained by both factors. The correlation between Components 1 and 2 was .17.

III. Methods and statistics

786

29. Mobile sensing for studying personality dynamics in daily life

(P-technique). In the corresponding R-script all steps of the analysis can be fully reproduced. This section of the tutorial for this chapter covers the classical P-technique as a PCA. Due to the inability to account for correlations across time as well as other limitations, the classical P-technique has been extended to include changes over time (Molenaar & Nesselroade, 2009). Hence, dynamic P-technique models take autocorrelations of subsequent measures across time into account (McArdle, 1988), and the chained P-technique can examine intraindividual dynamics within groups of subjects (Lee & Little, 2012).

Conclusion There is general consensus that the gold standard for measuring personality is direct naturalistic observation paired with self-reports of psychological experiences. Such multimethod approaches are valuable because the self-reports can offer insight about the mechanisms that produce the observed behaviors (e.g., Baumeister et al., 2007; Furr, 2009). Historically, direct naturalistic observation has been time consuming and difficult to implement (Barker & Wright, 1951; Craik, 2000). However, the recent advent of digital media technologies is set to sweep aside the obstacles that have held back research in this area. In this chapter, we introduced mobile sensing as a form of ambulatory assessment for personality research. Mobile sensing promises to change how personality researchers collect data about people’s thoughts, feelings, behaviors, and situational contexts. To facilitate the widespread use of mobile sensing in personality research, we described the procedural and analytical factors to consider when conducting a personality sensing study. In particular, personality dynamics research stands to benefit from the fine-grained assessments that permit examinations of within-person variability, patterns of change over time, and structural

dimensions. In the coming years, the wealth of data from sensors and other digital records are set to generate new findings that inform our understanding of the dynamic processes that describe daily life. This chapter aims to encourage the adoption of such approaches by providing a resource for those interested in studying personality using mobile sensing methods.

References Abdullah, S., Murnane, E. L., Matthews, M., Kay, M., Kientz, J. A., Gay, G., & Choudhury, T. (2016). Cognitive rhythms: Unobtrusive and continuous sensing of alertness using a mobile phone. In: Proceedings of the 2016 ACM international joint conference on pervasive and ubiquitous computing, ACM. Ai, P., Liu, Y., & Zhao, X. (2019). Big Five personality traits predict daily spatial behavior: Evidence from smartphone data. Personality and Individual Differences, 147, 285–291. Back, M. D., Stopfer, J. M., Vazire, S., Gaddis, S., Schmukle, S. C., Egloff, B., & Gosling, S. D. (2010). Facebook profiles reflect actual personality, not self-idealization. Psychological Science, 21(3), 372–374. Barker, R. G., & Wright, H. F. (1951). One boy’s day; a specimen record of behavior. Oxford, England: Harper. Bartlett, M. S. (1951). The effect of standardization on a χ2 approximation in factor analysis. Biometrika, 38(3/4), 337–344. Baumeister, R. F., Vohs, K. D., & Funder, D. C. (2007). Psychology as the science of self-reports and finger movements: Whatever happened to actual behavior? Perspectives on Psychological Science, 2(4), 396–403. https://doi.org/10.1111/j.1745-6916.2007.00051.x. Boase, J., & Ling, R. (2013). Measuring mobile phone use: Self-report versus log data. Journal of Computer-Mediated Communication, 18(4), 508–519. B€ ohmer, M., Hecht, B., Sch€ oning, J., Kr€ uger, A., & Bauer, G. (2011, August). Falling asleep with Angry Birds, Facebook and Kindle: A large scale study on mobile application usage. In: Proceedings of the 13th international conference on human computer interaction with mobile devices and services, ACM, pp. 47–56. Boyd, R. L., & Pennebaker, J. W. (2017). Language-based personality: A new approach to personality in a digital world. Current Opinion in Behavioral Sciences, 18, 63–68. Cattell, R. B. (1952). P-technique factorization and the determination of individual dynamic structure. Journal of Clinical Psychology, 8(1), 5–10. Cerny, C. A., & Kaiser, H. F. (1977). A study of a measure of sampling adequacy for factor-analytic correlation matrices. Multivariate Behavioral Research, 12(1), 43–47.

III. Methods and statistics

References

Chen, Z., Chen, Y., Hu, L., Wang, S., Jiang, X., Ma, X., … Campbell, A. T. (2014). ContextSense: unobtrusive discovery of incremental social context using dynamic bluetooth data. In: Proceedings of the 2014 ACM international joint conference on pervasive and ubiquitous computing: Adjunct publication, ACM, pp. 23–26. https://doi.org/ 10.1145/2638728.2638801. Chen, Z., Lin, M., Chen, F., Lane, N. D., Cardone, G., Wang, R., … Campbell, A. T. (2013). Unobtrusive sleep monitoring using smartphones. In: The 2013 7th international conference on pervasive computing technologies for healthcare (pervasive health), IEEE, pp. 145–152. Cheng, J., Edwards, L. J., Maldonado-Molina, M. M., Komro, K. A., & Muller, K. E. (2010). Real longitudinal data analysis for real people: Building a good enough mixed model. Statistics in Medicine, 29(4), 504–520. https://doi.org/10.1002/sim.3775. Chittaranjan, G., Blom, J., & Gatica-Perez, D. (2011). Who’s who with big-five: analyzing and classifying personality traits with smartphones. In: Wearable computers (ISWC), 2011 15th annual international symposium, IEEE, pp. 29–36. https://doi.org/10.1109/ISWC.2011.29. Chon, J., & Cha, H. (2011). Lifemap: A smartphone-based context provider for location-based services. IEEE Pervasive Computing, 10, 58–67. Craik, K. H. (2000). The lived day of an individual: A personenvironment perspective. In W. B. Walsh, K. H. Craik, & R. H. Price (Eds.), Person—Environment psychology: New directions and perspectives (pp. 233–266). Mahwah, NJ: Erlbaum. Csikszentmihalyi, M., & Larson, R. (2014). Validity and reliability of the experience-sampling method. In Flow and the foundations of positive psychology (pp. 35–54). Netherlands: Springer. de Montjoye, Y. A., Quoidbach, J., Robic, F., & Pentland, A. S. (2013). Predicting personality using novel mobile phonebased metrics. In Social computing, behavioral-cultural modeling and prediction (pp. 48–55). Berlin Heidelberg: Springer. https://doi.org/10.1007/978-3-642-37210-0_6. Dubey, H., Mehl, M. R., & Mankodiya, K. (2016, June). Bigear: Inferring the ambient and emotional correlates from smartphone-based acoustic big data. In: 2016 IEEE first international conference on connected health: Applications, systems and engineering technologies (CHASE), IEEE, pp. 78–83. Eagle, N., & Pentland, A. (2006). Reality mining: Sensing complex social systems. Personal and Ubiquitous Computing, 10, 255–268. https://doi.org/10.1007/s00779-0050046-3. Eagle, N., & Pentland, A. S. (2009). Eigenbehaviors: Identifying structure in routine. Behavioral Ecology and Sociobiology, 63(7), 1057–1066. Eagle, N., Pentland, A. S., & Lazer, D. (2009). Inferring friendship network structure by using mobile phone data.

787

Proceedings of the National Academy of Sciences of the United States of America, 106(36), 15274–15278. https://doi.org/ 10.1073/pnas.0900282106. EUGDPR (2018). The EU general data protection regulation. https://www.eugdpr.org/. (Accessed 11 February 2018). Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4(3), 272–299. https://doi.org/10.1037/1082-989X.4.3.272. Finch, W. H., Bolin, J. E., & Kelley, K. (2014). Multilevel modeling using R. Crc Press. Fleeson, W. (2001). Toward a structure-and processintegrated view of personality: Traits as density distributions of states. Journal of Personality and Social Psychology, 80(6), 1011. Fleeson, W., & Gallagher, P. (2009). The implications of Big Five standing for the distribution of trait manifestation in behavior: Fifteen experience-sampling studies and a meta-analysis. Journal of Personality and Social Psychology, 97(6), 1097. Friedman, H. S. (2000). Long-term relations of personality and health: Dynamisms, mechanisms, tropisms. Journal of Personality, 68(6), 1089–1107. Funder, D. C. (2001). Personality. Annual Review of Psychology, 52, 197–221. Funder, D. C. (2006). Towards a resolution of the personality triad: Persons, situations, and behaviors. Journal of Research in Personality, 40, 21–34. https://doi.org/ 10.1016/j.jrp.2005.08.003. Furr, R. M. (2009). Personality psychology as a truly behavioural science. European Journal of Personality, 23(5), 369–401. https://doi.org/10.1002/per.724. Goldberg, L. R., Acton, G. S., Ashton, M. C., Block, J., Boles, S., Andrade, R. D. ’., … Waller, R. Z. N. (2006). Doing it all Bass-Ackwards: The development of hierarchical factor structures from the top down. Journal of Research in Personality, 40, 347–358. https://doi.org/ 10.1016/j.jrp.2006.01.001. Gonzales, J. E., & Cunningham, C. A. (2015). The promise of pre-registration in psychological research. Psychological Science Agenda, 29(8) Retrieved from: http://www.apa. org/science/about/psa/2015/08/pre-registration.aspx. Goodwin, M. S., Velicer, W. F., & Intille, S. S. (2008). Telemetric monitoring in the behavior sciences. Behavior Research Methods, 40(1), 328–341. Gosling, S. D., & Mason, W. (2015). Internet research in psychology. Annual Review of Psychology, 66, 877–902. https://doi.org/10.1146/annurev-psych-010814-015321. Grice, J. W. (2001). Computing and evaluating factor scores. Psychological Methods, 6(4), 430–450. Grimm, K. J., Ram, N., & Estabrook, R. (2016). Growth modeling: Structural equation and multilevel modeling approaches. Guilford Publications.

III. Methods and statistics

788

29. Mobile sensing for studying personality dynamics in daily life

Hamaker, E. L. (2012). Why researchers should think “within-person”: A paradigmatic rationale. In Handbook of research methods for studying daily life (pp. 43–61). Guilford Press. Harari, G. M., Gosling, S. D., Wang, R., & Campbell, A. (2015). Capturing situational information with smartphones and mobile sensing methods. European Journal of Personality, 29, 509–511. https://doi.org/10.1002/per.2032. Harari, G. M., Lane, N., Wang, R., Crosier, B., Campbell, A. T., & Gosling, S. D. (2016). Using smartphones to collect behavioral data in psychological science: Opportunities, practical considerations, and challenges. Perspectives on Psychological Science, 11, 838–854. Harari, G. M., M€ uller, S. R., Aung, M. S., & Rentfrow, J. P. (2017). Smartphone sensing methods for studying behavior in everyday life. Current Opinion in Behavioral Sciences, 18, 83–90. Harari, G. M., M€ uller, S. R., & Gosling, S. D. (2018). Naturalistic assessment of situations using mobile sensing methods. In Oxford handbook of psychological situations. Oxford University Press. Harari, G. M., M€ uller, S. R., Stachl, C., Wang, R., Wang, W., B€ uhner, M., … Gosling, S. D. (2019). Sensing sociability: Individual differences in young adults’ conversation, calling, texting, and app use behaviors in daily life. Journal of Personality and Social Psychology, 119(1), 204–228. Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., & Jennions, M. D. (2015). The extent and consequences of p-hacking in science. PLoS Biology, 13(3)e1002106. Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30, 179–185. Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2(3), 196–217. Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110(15), 5802–5805. Lane, N. D., Miluzzo, E., Lu, H., Peebles, D., Choudhury, T., & Campbell, A. T. (2010). A survey of mobile phone sensing. IEEE Communications Magazine, 48(9), 140–150. https://doi.org/10.1109/MCOM.2010.5560598. Lee, I. A., & Little, T. D. (2012). P-technique factor analysis. In Handbook of developmental research methods (pp. 350– 362). The Guilford Press. Lu, H., Frauendorfer, D., Rabbi, M., Mast, M. S., Chittaranjan, G. T., Campbell, A. T., … Choudhury, T. (2012). Stresssense: Detecting stress in unconstrained acoustic environments using smartphones. In: Proceedings of the 2012 ACM conference on ubiquitous computing, ACM, pp. 351–360. Lu, H., Pan, W., Lane, N. D., Choudhury, T., & Campbell, A. T. (2009, June). SoundSense: Scalable sound sensing for

people-centric applications on mobile phones. In: Proceedings of the 7th international conference on mobile systems, applications, and services, ACM, pp. 165–178. McArdle, J. J. (1988). Dynamic but structural equation modeling of repeated measures data. In Handbook of multivariate experimental psychology (pp. 561–614). Boston, MA: Springer. McArthur, L. H., & Raedeke, T. D. (2009). Race and sex differences in college student physical activity correlates. American Journal of Health Behavior, 33(1), 80–90. Mehl, M. R., Pennebaker, J. W., Crow, D. M., Dabbs, J., & Price, J. H. (2001). The Electronically Activated Recorder (EAR): A device for sampling naturalistic daily activities and conversations. Behavior Research Methods, Instruments, & Computers, 33(4), 517–523. Mehl, M. R., & Robbins, M. L. (2012). Naturalistic observation sampling: The electronically activated recorder (EAR). In The handbook of research methods for studying daily life (pp. 176–192). Guilford Press. Mehrotra, A., Hendley, R., & Musolesi, M. (2016, September). PrefMiner: Mining user’s preferences for intelligent mobile notification management. In: Proceedings of the 2016 ACM international joint conference on pervasive and ubiquitous computing, ACM, pp. 1223–1234. Miller, G. (2012). The smartphone psychology manifesto. Perspectives on Psychological Science, 7, 221–237. https://doi. org/10.1177/1745691612441215. Miluzzo, E., Lane, N. D., Fodor, K., Peterson, R., Lu, H., Musolesi, M., … Campbell, A. T. (2008). Sensing meets mobile social networks: The design, implementation and evaluation of the CenceMe application. In: Proceedings of the 6th ACM conference on embedded network sensor systems, ACM, pp. 337–350. https://doi. org/10.1145/1460412.1460445. Molenaar, P. C., & Campbell, C. G. (2009). The new personspecific paradigm in psychology. Current Directions in Psychological Science, 18(2), 112–117. Molenaar, P. C. M., & Nesselroade, J. R. (2009). The recoverability of P-technique factor analysis. Multivariate Behavioral Research, 44(1), 130–141. https://doi.org/ 10.1080/00273170802620204. Mønsted, B., Mollgaard, A., & Mathiesen, J. (2018). Phonebased metric as a predictor for basic personality traits. Journal of Research in Personality, 74, 16–22. https://doi. org/10.1016/J.JRP.2017.12.004. Montag, C., Błaszkiewicz, K., Sariyska, R., Lachmann, B., Andone, I., Trendafilov, B., … Markowetz, A. (2015). Smartphone usage in the 21st century: Who is active on WhatsApp? BMC Research Notes, 8(1), 331. M€ uller, S. R., Harari, G. M., Mehrotra, A., Matz, S., Khambatta, P., Musolesi, M., … Rentfrow, J. P. (2017). Using human raters to characterize the psychological characteristics of GPS-based places. In: Proceedings of the

III. Methods and statistics

References

ACM on interactive, multimedia, wearable and ubiquitous technologies (IMWUT). Murnane, E. L., Abdullah, S., Matthews, M., Kay, M., Kientz, J. A., Choudhury, T., … Cosley, D. (2016). Mobile manifestations of alertness: Connecting biological rhythms with patterns of smartphone app use. In: Proceedings of the 18th international conference on human-computer interaction with mobile devices and services, ACM. Ozer, D. J., & Benet-Martinez, V. (2006). Personality and the prediction of consequential outcomes. Annual Review of Psychology, 57, 401–421. Park, G., Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Kosinski, M., Stillwell, D. J., … Seligman, M. E. (2015). Automatic personality assessment through social media language. Journal of Personality and Social Psychology, 108 (6), 934. Paulhus, D. L., & Vazire, S. (2007). The self-report method. In Handbook of research methods in personality psychology (pp. 224–239). Guilford Press. Rabbi, M., Ali, S., Choudhury, T., & Berke, E. (2011). In: Passive and in-situ assessment of mental and physical wellbeing using mobile sensors. In Proceedings of the 13th International Conference on Ubiquitous Computing (pp. 385–394). New York, NY: ACM. https://doi.org/ 10.1145/2030112.2030164. Rachuri, K. K., Musolesi, M., Mascolo, C., Rentfrow, P. J., Longworth, C., & Aucinas, A. (2010, September). EmotionSense: A mobile phones based adaptive platform for experimental social psychology research. In: Proceedings of the 12th ACM international conference on Ubiquitous computing, ACM, pp. 281–290. Rauthmann, J. F., Sherman, R. A., & Funder, D. C. (2015). Principles of situation research: Towards a better understanding of psychological situations. European Journal of Personality, 29(3), 363–381. Revelle, W., & Rocklin, T. (1979). Very simple structure: An alternative procedure for estimating the optimal number of interpretable factors. Multivariate Behavioral Research, 14, 403–414. Roberts, B. W., Kuncel, N. R., Shiner, R., Caspi, A., & Goldberg, L. R. (2007). The power of personality: The comparative validity of personality traits, socioeconomic status, and cognitive ability for predicting important life outcomes. Perspectives on Psychological Science, 2(4), 313–345. Saeb, S., Lattie, E. G., Schueller, S. M., Kording, K. P., & Mohr, D. C. (2016). The relationship between mobile phone location sensor data and depressive symptom severity. PeerJ, 4, e2537. Saeb, S., Zhang, M., Karr, C. J., Schueller, S. M., Corden, M. E., Kording, K. P., & Mohr, D. C. (2015). Mobile phone sensor correlates of depressive symptom severity in daily-life behavior: An exploratory study. Journal of Medical Internet Research. 17, e175. https://doi.org/10.2196/jmir.4273. Schoedel, R., Au, Q., V€ olkel, S. T., Lehmann, F., Becker, D., B€ uhner, M., … Stachl, C. (2018). Digital footprints of

789

sensation seeking. Zeitschrift f€ ur Psychologie, 226(4), 232–245. https://doi.org/10.1027/2151-2604/a000342. Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Dziurzynski, L., Ramones, S. M., Agrawal, M., … Ungar, L. H. (2013). Personality, gender, and age in the language of social media: The open-vocabulary approach. PLoS One, 8(9)e73791. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). Falsepositive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. Smith, T. W. (2006). Personality as risk and resilience in physical health. Current Directions in Psychological Science, 15 (5), 227–231. Soto, C. J. (2019). How replicable are links between personality traits and consequential life outcomes? The life outcomes of personality replication project. Psychological Science, 30(5), 711–727. https://doi.org/ 10.1177/0956797619831612. Stachl, C., Hilbert, S., Au, J. Q., Buschek, D., De Luca, A., Bischl, B., … B€ uhner, M. (2017). Personality traits predict smartphone usage. European Journal of Personality, 31(6), 701–722. Stachl, C., Schoedel, R., Au, Q., V€ olkel, S., Buschek, D., Hussmann, H., … B€ uhner, M. (2017). The PhoneStudy project. November 10. https://doi.org/10.17605/OSF.IO/UT42Y. Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Pearson Education. Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29(1), 24–54. Trull, T. J., & Ebner-Priemer, U. (2013). Ambulatory assessment. Annual Review of Clinical Psychology, 9, 151–176. Tsai, L., & Li, S. (2004). Sleep patterns in college students: Gender and grade differences. Journal of Psychosomatic Research, 56, 231–237. Tseng, V. W., Merrill, M., Wittleder, F., Abdullah, S., Aung, M. H., & Choudhury, T. (2016). Assessing mental health issues on college campuses: Preliminary findings from a pilot study (pp. 1200–1208). Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct. ACM. Vazire, S., & Gosling, S. D. (2004). E-perceptions: Personality impressions based on personal websites. Journal of personality and social psychology, 87(1), 123. Velicer, W. (1976). Determining the number of components from the matrix of partial correlations. Psychometrika, 41, 321–327. Wang, R., Chen, F., Chen, Z., Li, T., Harari, G., Tignor, S., … Campbell, A. T. (2014). StudentLife: assessing mental health, academic performance and behavioral trends of college students using smartphones. In: Proceedings of the 2014 ACM international joint conference on pervasive and ubiquitous computing, pp. 3–14.

III. Methods and statistics

790

29. Mobile sensing for studying personality dynamics in daily life

Wang, R., Harari, G., Hao, P., Zhou, X., & Campbell, A. T. (2015, September). SmartGPA: How smartphones can assess and predict academic performance of college students. In: Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing, ACM, pp. 295–306. Wang, W., Harari, G. M., Wang, R., M€ uller, S. R., Mirjafari, S., Masaba, K., & Campbell, A. T. (2018). Sensing behavioral change over time: Using within-person variability features from mobile sensing to predict personality traits. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2(3), 1–21.

Wrzus, C., & Mehl, M. R. (2015). Lab and/or field? Measuring personality processes and their social consequences. European Journal of Personality, 29(2), 250–271. Yarkoni, T. (2010). Personality in 100,000 words: A largescale analysis of personality and word use among bloggers. Journal of Research in Personality, 44(3), 363–373. Youyou, W., Kosinski, M., & Stillwell, D. (2015). Computerbased personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences, 112(4), 1036–1040.

III. Methods and statistics

C H A P T E R

30 Experience sampling and daily diary studies: Basic concepts, designs, and challenges Kai T. Horstmann Institute of Psychology, Humboldt-Universit€at zu Berlin, Berlin, Germany

O U T L I N E Background of ESM studies

792

Research questions for experience sampling studies 793 Approaching ESM from the perspective of test construction 794 Designing an experience sampling study Basic setup Detailed setup

802 802 806

Preregistration of experience sampling studies Pilot testing an experience sampling study

808

Platforms for experience sampling studies

809

Summary

811

References

811

can be answered: (A) What is the construct being measured? (B) What is the purpose of the measure? (C) What is the targeted population of persons and situations? Finally, practical advice is given on how to think through and pilot test an experience sampling study before data collection begins.

Abstract Experience sampling and daily diary methods have become increasingly popular among psychologists. The repeated assessment of persons in their daily lives allows capturing how a person feels, thinks, or behaves or what he or she desires in the very moment. The current chapter describes basic concepts of experience sampling studies, gives an overview of possible designs and challenges that may be encountered when setting up the first experience sampling study. To overcome these challenges, three basic questions

The Handbook of Personality Dynamics and Processes https://doi.org/10.1016/B978-0-12-813995-0.00030-3

807

Keywords Experience sampling, Daily diary, Study design, Preregistration, Simulation

791

# 2021 Elsevier Inc. All rights reserved.

792

30. Experience sampling and daily diary

Tests of modern conceptualizations of personality require the repeated examination of persons’ thoughts, feelings, desires, and behaviors in their daily lives (e.g., Baumert et al., 2017; Jayawickreme, Zachry, & Fleeson, 2019; Wrzus & Roberts, 2017). If a theory, for example, claims that personality traits can be conceptualized as density distributions of personality states (Fleeson & Jayawickreme, 2015), then, in order to test this theory, it is required to assess the distribution of personality states. This can, for example, be achieved through experience sampling, where the same person reports on their current states several times a day. From a technical perspective, it is now comparatively simple to sample repeated (self-)assessments and experiences of a person in their daily life. Similarly, a number of recent and forthcoming publications underline the widespread application of the experience sampling method (ESM) in personality research (Horstmann & Rauthmann, in preparation; Horstmann, Rauthmann, Sherman, & Ziegler, 2020; Horstmann & Ziegler, 2020; Quintus, Egloff, & Wrzus, under review; Sherman, Rauthmann, Brown, Serfass, & Jones, 2015). The use of ESM and daily diary studies as a mean to examine psychological research questions will most likely increase. In order to examine personsituation dynamics, I provide a starting point for conducting experience sampling and daily diary studies. Although many comparable and valuable resources exist, the goal of the current chapter is to structure the process of setting up an ESM study. This book chapter is tailored toward someone who wants to conduct their first ESM study. It provides a walk-through from thinking about research questions to setting up a study as well as testing and possibly preregistering it. However, there are of course many additional resources to consult. Shiffman, Stone, and Hufford (2008) as well as Scollon and KimPrieto (2003) give a very broad overview over

the method and different applications. Wrzus and Mehl (2015) also give a broad overview over ESM and contrast it with observations obtained in the laboratory. Hofmans, De Clercq, Kuppens, Verbeke, and Widiger (2019) outline the whole process of conducting an ESM study, including setup and data analysis. Finally, some publications focus on narrower issues, such as the optimal ESM schedule (i.e., when participants receive their daily invitations; van Berkel et al., 2019), sampling frequency and length of the survey (Eisele et al., 2020), the instructions to be used (Stone, Wen, Schneider, & Junghaenel, 2020), the estimation of reliability (Nezlek, 2017; Sch€ onbrodt, Zygar, Nestler, Pusch, & Hagemeyer, submitted), validity (Vogelsmeier, Vermunt, van Roekel, & De Roover, 2019), or the construction of state measures (Horstmann & Ziegler, 2020). However, there are not many that focus on the very practical issues of setting up an ESM study. Although the current chapter does not make any general recommendations, it provides questions and issues to think about when setting up a study that is geared to understand personsituation dynamics.

Background of ESM studies ESM and daily diary methods describe the repeated assessment of a person’s momentary surroundings, situations, or experiences (Hektner, Schmidt, & Csikszentmihalyi, 2007; Wrzus & Mehl, 2015). ESM and daily diary may be subsumed under the general method ecological momentary assessment (EMA), which also includes ambulatory assessment (AA). Whereas AA may employ other assessment methods apart from self-report (e.g., audio snippets, pictures, video recordings), ESM mostly refers to a repeated self-assessment of a person’s current experiences. That being said, the use of terminology varies and is not always

III. Methods and statistics

Research questions for experience sampling studies

consistent across labs, disciplines, or countries. For example, in ESM studies surveys can be taken every 3 h, while daily diary studies entail a survey once a day (mostly at the end of the day). Each assessment or survey should be brief enough to be taken unobtrusively during daily life (for example, about 2–3 min to complete a single survey). ESM studies have been conducted for more than 100 years. One of the earliest references dates back to Bevans in 1913 who investigated how “working men spent their time” (Bevans, 1913). Another more elaborate study by Fl€ ugel from 1925 investigated “feeling and emotion in everyday life” (Fl€ ugel, 1925).a The study by Fl€ ugel already mentions several issues that are still problematic for current users of ESM, such as the estimation of the reliability of the obtained scores (Nezlek, 2017; Sch€ onbrodt et al., submitted), or the issue of response styles that may distort records from participants. Over the last 100 years, and especially in most recent years, the method of sampling participants’ momentary experiences instead of assessing solely their general characteristics has become more and more popular (Scollon & Kim-Prieto, 2003). Furthermore, recent publications do not only examine substantial questions by using ESM (e.g., Rogers & Biesanz, 2019; Sun & Vazire, 2019), but also investigate psychometric properties of scores obtained during ESM (see Horstmann & Ziegler, 2020, for an overview) or combine data from several studies to investigate the role of methodological factors (e.g., Horstmann & Rauthmann, in preparation; Podsakoff, Spoelma, Chawla, & Gabriel, 2019). Additionally, ESM studies that rely on the use of participants’ smartphones can simultaneously take advantage of the smartphones’ sensors to collect additional informative data (Buschek et al., 2018; Chittaranjan, Blom, &

793

Gatica-Perez, 2013; Harari et al., 2019; Schoedel et al., 2018; Stachl et al., 2017), thereby further strengthening the usefulness and value of ESM to investigate psychological processes. It is apparent, at this stage, that the ability to understand and conduct ESM studies will become more and more important in psychological research. However, for which research questions is an ESM study appropriate?

Research questions for experience sampling studies Generally speaking, ESM studies can be used to obtain information at two different levels. First, and as initially designed for this purpose, ESM can be conducted to obtain insights on within-unit processes. Here, “unit” would usually mean a person, but could also refer to other units that develop over time, such as a group of persons, a friendship dyad, and so on. For example, it would be possible to assess participants’ current mood over the course of a week and see how their mood changes in reaction to certain events (such as getting up, being rejected, meeting a friend, and so on). Second, scores obtained during ESM can be aggregated and then used as information at the between-unit level. For example, the scores of participants’ moods can be used to describe each person’s average level of mood. Similar to the first case, “unit” may refer to more than just persons, for example, to groups of persons that are repeatedly assessed. Here, taking the average across all assessments of a group would then be a characteristic of the group or the dyad. Furthermore, there may be multiple levels, such as momentary assessments nested in (i.e., grouped in) days, nested in weeks, nested in persons, nested in cities, etc. For each of these, ESM can provide

a

The study is generally worth reading and mentions several issues and problems that need to be overcome when conducting or planning an experience sampling study.

III. Methods and statistics

794

30. Experience sampling and daily diary

information at the within-unit level or at the between-unit level. For the description of person-situation dynamics, ESM is most commonly applied to obtain information about multiple experiences of persons in situations (within-person) and person characteristics (between-person; see Horstmann & Ziegler, 2020, for an overview). For the remainder of the chapter, I will therefore assume two levels, namely, persons (also called between-person level, level 2, person-to-person variations), and persons in situations (also called within-person level, level 1, situation-to-situation variations), although the line of reasoning can be very much applied to any number of levels.

Approaching ESM from the perspective of test construction Planning and conducting an ESM study is— at least in certain aspects—similar to that of test construction: Ziegler (2014) argued that, before a test is constructed, three basic questions need to be answered: (1) What is the construct being measured? (2) What is the purpose of the measure? (3) What is the targeted population? This may similarly be applied to ESM studies. Conducting an ESM study will eventually result in test scores that are obtained to investigate a certain construct under specific circumstances (i.e., during participants’ normal lives) from a specific population in order to examine a certain research question. When planning an ESM study, answers to the three questions should guide the design of the study. This will help designing a study that is, on the one hand, able to provide sufficient answers to the research questions asked, while it is, on the other hand, not too burdensome on the participant. What is the construct being measured? The first question that should be answered concerns the construct that is being assessed, such as “extraversion,” “intelligence,” or “mental speed.” However, before the specific construct (e.g.,

extraversion) is defined, one should first consider the different types of constructs that one is interested in (e.g., personality). ESM studies for the examination of personsituation dynamics are frequently used to assess, among other things, personality states (Horstmann & Rauthmann, in preparation). The most recent definition of personality states conceptualize them as “[q]uantitative dimension[s] describing the degree/extent/level of coherent behaviours, thoughts, and feelings at a particular time” (Baumert et al., 2017, p. 528). Although one may technically debate whether the “and” linking the state categories in this definition has to be understood as an “and” or rather as an “or” (Horstmann & Ziegler, 2020), it is without argument nearly impossible to imagine items that capture all of those three categories at once. Additionally, yet referring to personality traits, Wilt and Revelle (2015) pointed out that personality traits may be composed of Affect, Behavior, Cognition, and Desire (ABCD). Applying this scheme to personality states, motivation (e.g., current desires, needs, etc.) may thus also be added as another viable state category. Further, and of great importance to the assessment of person-situation dynamics (Colom, Bensch, Horstmann, Wehner, & Ziegler, 2019), it is also possible to assess specific (cognitive) abilities with a smartphone (Steger, Schroeders, & Wilhelm, 2019). Having large item pools of parallel items would technically allow assessing momentary cognitive ability, for example, fluid intelligence, during ESM as well. However, to the best of my knowledge, no one has yet conducted such a study. Although some may argue that abilities should not vary from moment to moment (requiring a test of maximum performance), research has shown that ability test scores can be influenced, for example, by motivational factors (e.g., Duckworth, Quinn, Lynam, Loeber, & Stouthamer-Loeber, 2011). Given the withinperson variation in cognitive ability or performance, this phenomenon may also be relevant for future ESM studies.

III. Methods and statistics

Research questions for experience sampling studies

Finally, it is also possible to ask participants not only about their inner states, but also about their (external) current situation (Horstmann, Rauthmann, & Sherman, 2018; Horstmann & Ziegler, 2019; Rauthmann, Horstmann, & Sherman, 2020; Sherman et al., 2015; Ziegler, Horstmann, & Ziegler, 2019). During ESM, a situation can be assessed at three different levels: Situational cues (physical elements that are present), situation characteristics (psychologically relevant aspects of the situation), or situation classes (situations that share a common set of cues or characteristics) (Rauthmann, Sherman, & Funder, 2015). Of course, asking participants about their current situation will also be confounded with their current emotional or affective states, their focus, or their attention toward detail (Horstmann et al., 2020; Horstmann & Ziegler, 2019). Even if participants were in the “same” situation (e.g., in the same room or the same lecture), they do not necessarily report the same cues, characteristics, or even situational classes. A rating by any participant will therefore be somewhat subjective and confounded with their momentary experience— hence, the term experience sampling. Table 1 provides examples for the assessment of different phenomena (i.e., affect/feelings/ emotions, behavior, cognition, desire, situations, abilities). The list of possible phenomena that can be assessed is not exhaustive; for example, situational interests (Roemer, Horstmann, & Ziegler, 2020) or morality (Hofmann et al., 2014) also come to mind. Yet, these may also be subsumed under affect, cognition (interests), or behavior (morality). Table 1 lists examples of constructs and items that can be assessed during ESM. Although this list may not be exhaustive, it provides a starting point for selecting and classifying items and constructs that are assessed. In some cases, assigning an item to a specific type of construct is not

795

straightforward. For example, the item “In the past 3 h I thought about how to make sure I did not mislead somebody” (Hofmann et al., 2014) has been classified as behavior, although it clearly assesses a cognition (i.e., thoughts), albeit one that is about one’s own behaviors. This has not to be problematic, but it has to be considered when interpreting correlates of such measures, as correlations with other constructs that capture primarily thoughts may be inflated. On the other hand, correlations to “real” behaviors (i.e., directly observable motions or expressions) can be reduced. What is the purpose of the measure? As outlined, measures observed during ESM can at least serve two broad purposes. The first is to obtain general, time-invariant information on the person. The second is to obtain information on the momentary experience of a person, that is, the ins and outs of their daily life. The research question (i.e., whether it focuses on the betweenor the within-level of analysis) influences the design of the ESM study in substantial ways. Research questions at the between-person level

Consider, for example, Whole Trait Theory (Fleeson, 2001; Fleeson & Jayawickreme, 2015; Jayawickreme et al., 2019). One aspect of Whole Trait Theory (the “descriptive” part) postulates that the average of daily experiences (e.g., operationalized as the average of a person’s daily experiences over a longer period of time) should correspond to a person’s trait level. Simply put, if a person has a trait score of “3” on an extraversion scale, the average of their daily manifestations of extraversion should also be 3.b For example, Finnigan and Vazire (2018) examined if the averaged of self-reported states of behavior predicted the relevant outcome (hereof an observer report on the same trait) above and beyond what is already predicted by the global self-report of the trait. For such a research question, it is

b

Assuming that traits and trait manifestations are assessed on the same response scale (e.g., a Likert-type scale from 1 to 5) and that the scale points have the same meaning for traits and states (commensurability). III. Methods and statistics

796 TABLE 1

30. Experience sampling and daily diary

Different phenomena, constructs, and example items that can be assessed during experience sampling.

Construct type

Example of construct

Example item

References

Positive affect

How do you feel during the current situation: cheerful?a

Hampel (1977) and Horstmann et al. (2020)

Happiness

How do you feel right now: sad-happy?a

Sherman et al. (2015)

Extraverted behavior

Which of the two terms is better suited for describing your behavior during the previous hour: responsible or irresponsible?a

Bleidorn (2009)

Moral behavior

In the past 3 h I thought about how to make sure I did not mislead somebodya

Meindl, Jayawickreme, Furr, and Fleeson (2015)

Cognition

Sense of purpose

Do you think that your life has a clear sense of purpose at the moment?a

Hofmann, Wisneski, Brandt, and Skitka (2014)

Desire

Everyday desiresb

Did you experience a desire in the last 30 min? If so, please chose which one: eating, alcohol, etc.b

Hofmann, Vohs, and Baumeister (2012)

Extra-pair desire

I had sexual fantasies about men other than my partnera

Arslan, Schilling, Gerlach, and Penke (2018) and Haselton and Gangestad (2006)

Crystallized intelligence

Which of the following terms describes a measure of dispersion? (a) Quartile, (b) Median, (c) Mode, (d) Standard Deviationc

Steger et al. (2019)

Internal (persons) Affect/ emotions

Behavior

Cognitive abilities

External (situations) Cues

Objects



– c

Persons

Who have you been interacting with in the last 30 min?

Human, Mignault, Biesanz, and Rogers (2019)

Characteristics

Duty

A task needs to be donea

Rauthmann and Sherman (2016)

Classes

Social contexts

In which situation did the interaction take place? uni, meeting, celebration, etc.b

Geukes, Nestler, Hutteman, K€ ufner, and Back (2017)

a

The response is given on a rating scale where participants agree to the statement or, in the case of bipolar rating scales, indicate where between the two poles they currently identify. b The response is by selecting multiple options from a list. c The response is given by selecting one correct answer from a list.

important that the assessed behavior reflects the content of the global self- or other-report. Assume that a trait measure of extraversion contains three facets: Sociability, Assertiveness, and Energy Level (Soto & John, 2017). Now, if one

were to test the assumption that the average extraverted behavior would correspond to the three facets, what should the corresponding ESM measure (for extraverted behavior) ideally look like?

III. Methods and statistics

Research questions for experience sampling studies

Choosing the right number of items Let us assume we take the structural model that Soto and John (2017) suggested for extraversion. This model consists of three latent variables (the facets), which are assessed by four items each. The facets correlate at 0.54, 0.48, and 0.59 in the internet validation sample presented by Soto and John (2017). For illustrative purposes, I assume that all items have the same loading on their latent variable (fixed to 1). Extraversion as a trait is now assessed with 12 items, and each of its facets is now assessed with 4 items. We can assume that each of the 12 items describes a manifestation of the personality trait that can be assessed in daily life (e.g., an affect, cognition, behavior, or thought). For several reasons, it may be impossible to assess all 12 items at each measurement occasion, and the number of items must therefore be reduced. The question now is, how should these items be combined to best capture the content of the personality trait and its manifestation in daily life, that is, the extraversion state? To show how different choices in study designs can influence the outcome, it is possible to simulate the outcome of different sampling plans. Here, we assume that all items are equally representative of the construct (which most likely is not the case). As we are interested in a person-characteristic (i.e., the average across all assessments) and not in the momentary experience of the person, we may chose a planned missingness design (Arslan, Reitz, Driebe, Gerlach, & Penke, 2020; Silvia, Kwapil, Walsh, & MyinGermeys, 2014). There are many different ways in which a planned missingness design for the assessment of momentary behaviors can be implemented. Table 2 gives an overview of six different designs: First, all 15 items can be assessed at each measurement occasion (“full” condition). Second, one item per facet can be sampled at random at each measurement occasion

c

797

(“one random per facet” condition). Third, per assessment, one item is fixed (always the same item) and two items are assessed at random (“anchor” condition). Fourth, at each assessment, the same item per facet is sampled (“one fixed per facet” condition). Fifth, one of the 15 items can be assessed at random (“one random” condition). Sixth, the same item could be assessed at each measurement occasion (“one fixed” design). Each design has specific advantages and disadvantages that are described in Table 2. How does each sampling procedure play out with respect to the hypothesis that the average extraverted behavior is corresponding with the trait? I will address this question with a simulation: We assume that each extraversion state that is assessed in the items is displayed by each person with a standard deviation of SDbeh ¼ 5 (on an item ranging from 1 to 5). Assume a person has a true score of 2 on the first item. With SDbeh ¼ 5, we would assume that their state would be normally distributed, s  N (2, 25) (i.e., a mean of 2 and a standard deviation of 5 or a variance of 25).c If this is simulated, we can then sample states from these distributions (i.e., pretend we would measure them) according to different planned missingness designs. The sampled states can then be averaged and correlated with the trait score. Here, the trait score for a person is represented by the average of their item scores. The higher the correlation between the averaged state and the trait, the more successful is the design in capturing the trait’s content. Fig. 1 shows the result of the different sampling procedures. In Fig. 1, it can be seen that the scores obtained under the “full” sampling procedure result in the highest correlation between traits and average behaviors. The two random procedures that sample three items per measurement occasion do not differ very much in their overall magnitude. The anchor technique, in which one item is always the same and two items are

The size of the standard deviation is arbitrarily selected for illustrative purposes only.

III. Methods and statistics

798 TABLE 2

30. Experience sampling and daily diary

Five different questionnaire design strategies for experience sampling. # of items

Design

Description

1. Full

Sample all 15 items at each measurement occasion

2. One random per facet

Sample a different, randomly selected item at 3 each occasion

+ All facets represented + Fewer items + Average burden + Different items

3. Anchor

One item fixed, two items at random at each occasion

+ All facets represented + Fewer items + Average burden  Different items, but one item always similar

4. One fixed per facet

Sample the same item from each facet at each 3 occasion

+ All facets represented + Fewer items + Average burden  The same item each occasion

5. One random

Sample one different random item for all facets at each occasion

+ Low participant burden  Narrow representation of construct

6. One fixed

Sample the same item from one facet at each 1 occasion

12

3

1

Advantages/disadvantages + Breadths of construct fully captured  Many items  Participant burden

+ Lowest participant burden  Very narrow representation of construct

Note. # of items, number of items at each assessment; + indicates a benefit of the design; indicates a potential deficit.

FIG. 1

The simulated correlation between a trait report (e.g., extraversion) and an average state report, depending on the number of items that are sampled at each measurement occasion.Note. For a description of the individual lines, see Table 2.

III. Methods and statistics

Research questions for experience sampling studies

sampled at random, generates the second highest correlation. Note that in the anchor design, the item that is sampled at each occasion should be most representative of the construct. However, the sampling procedure “one random per facet” achieves a higher correlation between average state and trait. This must necessarily be so, given that this procedure captures a broader construct space compared to the “one fixed per facet” procedure. Finally, the “one random” procedure relies on one item per measurement occasion, leading to the smallest correlation between trait and average behavior. Note that the measurement models have assumed that all items load equally on their respective facets. This is, of course, not a realistic assumption. If items are heterogeneous and have thus different loadings, then the difference between the second and third strategies must necessarily be larger. It is additionally noteworthy that for research questions focusing on the convergence of an average state behavior with any outcome (see, e.g., Augustine & Larsen, 2012; Finnigan & Vazire, 2018; Rauthmann, Horstmann, & Sherman, 2019), not many measurement occasions are required—certainly not more than 20, a result that is consistent with prior findings (Augustine & Larsen, 2012; Epstein, 1980; Horstmann & Rauthmann, in preparation). To summarize, this small simulation showed that the study design, such as the number of selected items, must necessarily be influenced by the research question and the purpose of the measure. However, this simulation has also glossed over several other assumptions that may not necessarily hold in practical contexts. We have assumed here that each item is applicable at every measurement occasion, that participants do not have any missing data, and that it is possible to sample from a normal distribution of behavior. However, it is a very strong assumption that behavior in everyday life is normally distributed. This point will be discussed later during the discussion of different sampling procedures.

799

Research questions at the within-person level

Research questions that are asked at the within-person level may be concerned with (a) the development of person characteristics, (b) the explanation of variance in state measures (e.g., Rauthmann, Jones, & Sherman, 2016; Sherman et al., 2015), or (c) the explanation of sequences of events and experiences (e.g., Arslan et al., 2018). ESM is mostly not used to address the first question. The development of person characteristics requires, of course, the repeated assessment of person characteristics at the person level. These questions are nowadays usually answered using existing data from large panel studies, rather than by an ESM design. ESM designs are especially suited to answer the last two questions posed previously. The important difference between the two questions is the aspect of time, which again influences the study design substantially. If the sequence of measurement occasions is irrelevant, the time of the assessment must not be further considered. In this case, one would be mainly interested in questions such as “Does perceiving a moral event co-occur with greater happiness at the same moment?” (Hofmann et al., 2014) or “Is perceiving a situation as adverse associated with less agreeable behavior at the same situation?” (Sherman et al., 2015). Here, the purpose of the measure is mainly to describe the momentary experience of a person, and the momentary experience at Time 1 is considered independent from the experience at Time 2 (i.e., the focus of the study lies purely on the cross-sectional associations or contemporaneous associations, as opposed to autoregressive or crosslagged effects). Concerning the design of the ESM study, it is therefore irrelevant to even assess the time of the measurement (e.g., for reasons of privacy) or to include so-called access windows (i.e., participants can only take part a certain time after they were invited).

III. Methods and statistics

800

30. Experience sampling and daily diary

Alternatively, the purpose of the study can be the description of sequences and development of events and experiences over time. Questions that may be examined in this context are, for example, “Does perceiving a situation as adverse lead to disagreeable behavior at the next measurement occasion?” (Rauthmann et al., 2016). It is a commonly held belief that time intervals should be kept constant in this instance between as well as within persons. “Between-persons” means that all persons have the same fixed schedule of participating, and “within-persons” that a person takes part, for example, every 3 h. Achieving both in ESM studies is nearly impossible. However, as Voelkle and Oud (2013) point out, the belief that time intervals need to be equal within and between participants stems from three assumptions: for comparing parameters, for detection of the generating process, and for statistical analyses (Voelkle & Oud, 2013). According to Voelkle and Oud (2013), these arguments “are all wrong” (p. 104). In some cases, the statistical model employed does not require constant intervals or may even benefit from unevenly spaced intervals (Voelkle & Oud, 2013). Before setting up an ESM study, one may think about the research question and the appropriate statistical technique—and whether it requires evenly spaced intervals or not. If evenly spaced intervals are nevertheless required, then two solutions may be possible: First, the study may be set up so that participants can only participate during specific periods (i.e., access windows). However, if participants miss these windows frequently, it may lead to frustration and therefore attrition in the sample. Alternatively, measurement occasions can be excluded after they have been collected so that the remaining measurement occasions have equally spaced time intervals. This approach is usually preferable, as the data

can still be analyzed for questions that do not involve a sequential analysis.d What is the targeted population? In ESM, two populations are targeted: The population of persons (participants) and the population of situations (participants’ environments). Both can be selected, controlled, and influenced (sometimes even unintentionally). Population of persons

Most people nowadays have a smartphone, and most smartphones have some form of internet access. To illustrate, in 2018, 77.5% of households in Germany had at least one smartphone, and 96.7% any form of mobile phone (Statistisches Bundesamt, 2018). Although this means that—at least in Germany—some parts of the population may not yet be ready to participate with their own smartphone, it may still be possible to lend smartphones or other devices (e.g., tablets) to participants. The availability and accessibility of mobile devices for the purpose of data collection is therefore not the limiting factor. Further, even if mobile internet access is not available, data can be collected via Apps that run without an internet connection. When approaching the population of persons from the perspective of test construction, it is important to consider if the items and the study design are “the right fit” for the participants. As with any study, this concerns the item difficulty or the wording of the items (Rammstedt & Kemper, 2011; Ziegler, 2014). For example, some items may only apply to mothers or partners, and other questions are only relevant during certain periods of a person’s life (e.g., caring for children) or during specific experiences throughout the day (e.g., for those that

d

Note that these considerations are independent from the design (interval-contingent or event-contingent, see later in this chapter for a discussion), as both evenly and unevenly spaced intervals can be obtained in interval-contingent as well as eventcontingent designs (of course, depending on the kind of event that is examined).

III. Methods and statistics

Research questions for experience sampling studies

commute) (Horstmann, Ziegler, & Ziegler, 2018). The nature of any ESM study is that participants are repeatedly contacted, and it is therefore possible to adapt items, wording, etc. based on prior information of the participants (Sauermann & Roach, 2013). Here, it is possible to obtain information about individual persons, such as their gender or if they have children. It is then possible to adapt items accordingly (i.e., by adjusting them to the gender of the participant), or by omitting them if they are irrelevant. Finally, one may consider if participants have sufficient skills to participate. Participation would in most cases consist of opening a program (e.g., email, message, survey app), clicking on a link to the study platform, scrolling, typing text or numbers, and submitting responses. In some cases, training of participants in the lab or by using tutorials (e.g., in the form of videos) can be appropriate. Population of situations

Contrary to cross-sectional studies or panel studies, in which the situation is either held constant or assumed not to be relevant to the characteristic assessed, the situation in which participants respond to their questionnaires matters a lot. Research has shown that momentary behaviors, affect, or cognitions vary by situations (Horstmann et al., 2020; Sherman et al., 2015), and that about half of the variance that is observed in ESM studies can potentially be explained with momentary characteristics of the person or the situation (Podsakoff et al., 2019). Of course, if no variability from situation to situation existed, conducting an ESM study would be pointless. Depending on how the ESM study is set up, different situations will be targeted. Wheeler and Reis (1991) differentiated three sampling procedures: (1) intervalcontingent (participants take part after a certain time has passed), (2) signal-contingent (participants take part whenever they are invited), and (3) event-contingent (participants take part whenever something happens) designs.

801

Nowadays, however, most ESM studies will use a signal to invite participants: the question is then what triggers the signal? If the signal is triggered based on a predefined schedule, it is an interval-contingent design (or sometimes referred to as a time-contingent design), and if the signal is triggered by an event, one speaks of an event-contingent design. I will therefore only differentiate between these two designs. Interval-contingent When using intervalcontingent sampling, participants are required to respond after certain intervals. These can be fixed or random intervals; in both cases, it will be unknown where a participant is when they are sampled. Therefore, interval-contingent sampling leads to the assessment of participants in random situations. Targeting random situations during ESM is especially useful if the research question is related to general processes (e.g., how do situations predict simultaneous behavior?). The advantage to targeting random situations is that this flexible design can generate a lot of data in a short period of time. Here, every sampled data point should be considered valid as it represents a valid experience of the participant in their daily life. The disadvantage of a schedule that focuses on random situations is that participants can somewhat choose when to participate. This may lead to a bias in the estimation of the effect as certain situations just do not work for experience sampling. First, the very nature of the situation may inhibit the participant’s interest or ability to take part. For example, the DIAMONDS taxonomy (Rauthmann et al., 2014) suggests that situations can be described on eight dimensions, among them Mating (assessed via the item “a potential romantic partner is present”). It is typical for this dimension to be right-skewed: Usually, when participants take part, they most likely do not experience high mating situations, or vice versa (Horstmann et al., 2020; Roemer et al., 2020). Reports from these situations are then not

III. Methods and statistics

802

30. Experience sampling and daily diary

missing at random as a quality of the situation (e.g., high perceived mating) is directly related to the fact that it was not recorded. Second, participants may be in the same situation for a very long period of time, which will also lead to decreased variance in their scores. The time before exam periods where students learn a lot during the day, for example, may restrict students’ daily experiences. Here, a student may report the same value over and over, not because they are not complying, but because they are not having much diversity in their daily experiences. Finally, participants’ daily lives may simply not allow them to complete many surveys in different situations. If one were to set up an ESM schedule which required participants to take part at 9 am, 12 am, 3 pm, 6 pm, and 9 pm, at least three of these five measurement occasions would normally take place during regular working hours. In that sense, the context “work” would be overrepresented, whereas the context “nonwork” would be underrepresented. It is a theoretical and empirical question whether this matters, but given that behavior varies within and across contexts (Geukes, Nestler, Hutteman, K€ ufner, et al., 2017; Oud, Voelkle, & Driver, 2017), it should be considered when interpreting results from ESM studies that have sampled random situations. Event-contingent Alternatively, in study designs targeting specific situations, participants are asked to respond whenever a predefined event occurs. For example, subjects can be actively asked to participate whenever they had a social interaction (Geukes, Nestler, Hutteman, Dufner, et al., 2017). This design is then called an event-based assessment. Although the event-based design has the advantage of assessing only those situations one is interested in, a problem is that participants need to be frequently reminded to participate in case they forget what the event was that they were supposed to respond to. Using smartphones, it is also possible to quietly track participants. For example, a signal could be triggered

whenever they arrive at work, when they have moved for more than X minutes, etc. When sampling fixed situations, the generalizability of the results is potentially limited, while it may be more valid for those situations that were targeted, compared to an approach that targets random situations. Combination of interval-contingent and eventcontingent designs Finally, both designs can be

combined. A very simple example is the combination of an experience sampling study and a daily diary study (see later for an example). The experience sampling part could target random situations in participants’ daily lives. At the end of each day, participants could receive an end-of-the-day survey, asking them about a specific interaction they experienced during the day.

Designing an experience sampling study How can an ESM study look like, and what decisions have to be made at each stage? Earlier, I have outlined already three broader decisions (What is the construct being measured? What is the intended purpose of the measure? What is the targeted population of persons and situations?). After having answered these questions, one can proceed to the setup of the ESM study.

Basic setup When planning an ESM study, it usually contains the following six stages (although some are optional): 1. Registration of Participants, 2. Assessment of Time-Invariant (Person-) Characteristics, 3. Repeated Assessment Hourly, 4. Repeated Assessment Daily, 5. Exit Assessment, 6. Feedback and Debriefing. Fig. 2 shows the stages of the setup of the ESM study. 1. Registration of Participants. During registration, participants must first give their

III. Methods and statistics

803

Designing an experience sampling study 1. Registration of Participants

2. Assessment of Time Invariant Characteristics

3. Repeated Assessment Hourly

Collect basic information required to conduct the study, such as e-mail, phone number, times that participant can be reached

Assess timeinvariant characteristics of participants, such as age, gender, socio-economicalstatus

Repeated assessments on an hourly basis, such as affect, behavior, situations, social contacts, locations, geo-position, etc.

FIG. 2

4. Repeated Assessment Daily

Repeated assessment of day-invariant characteristics, such as satisfaction with day, pans for next day, general evaluation

5. Exit Assessment

6. Feedback and Debriefing

Assessment of participant characteristics required to end the study, such as compliance with study, but also bank details or specific consent

Send participant feedback and debriefing about study, share personal data with participant, disclose purpose of study

A general layout of an experience sampling study, including a daily diary part.

informed consent. This means that they also have to be informed about the purpose of the study, the procedure, the data that are collected (especially if additional data is tracked, such as their geo-location), and how the data will be stored and distributed. Most of the information presented here would not differ between an ESM study and any other study, and it would go far beyond the scope of the chapter to list all elements of a wellcrafted consent form (especially as they may differ from country to country). When participants have consented to participate, one should only collect their contact information. This ensures two things: First, this sensitive and identifying information can be stored separately from all other personal data. Second, participants who registered can be contacted later and thereby do not have to provide all information at the spot. Consequently, participants can register during an otherwise inconvenient time using their mobile phones and to fill out lengthy demographic surveys later at home. There are three ways of contacting participants during an ESM study: via email, SMS, or by triggering a signal from an App. Each has its own advantages and disadvantages (emails are usually for free, but maybe not frequently checked, while SMS cost about 0.10 Euro per signal, and Apps must be installed on the participant’s device). After registration of participants, it is important to immediately check if the contact

information is valid, that is, if the participant can be reached. At this stage, it may also be possible to collect the first name or nickname of the participant. Sauermann and Roach (2013) demonstrated that using participant’s name and customizing reminders or invitations leads to higher participant engagement and lower attrition. If participants are compensated financially, one could also assess their bank details at this stage. Note that it can be useful to have participants register and send them an email directly afterwards. This would allow participants to register (and remind them later that they registered and wanted to participate), even if they do not have time to complete the initial survey right away. 2. Assessment of Time-Invariant Characteristics. The time-invariant part of the study usually assesses information that is assumed not to vary during the course of the ESM study. This can be demographic information, personality traits, preferences, interests, income, socioeconomic status, and so forth. Participants can also be screened (e.g., in terms of inclusion criteria) or placed into experimental groups according to a random selection or some predefined criteria. It is also possible to include variables that will be used to provide participants with personalized feedback, such as their personality scores compared to other participants (see Point 6). Additionally, it is possible to ask participants when they would

III. Methods and statistics

804

30. Experience sampling and daily diary

like to start the next section of the study or when, during the ESM phase, they will be unavailable. This would allow for a personalized ESM schedule, which can conserve resources. 3. Repeated Assessment Hourly. The repeated assessment hourly is the main part of the ESM study (or every 3 h, every 30 min, or completely at random several times per day; or, in an event-contingent design, based on the occurrence of events). When designing the ESM part, several questions have to be addressed (see Table 3). Although the questions and possible options in Table 3 may not be exhaustive, they nevertheless provide an overview of the issues faced when constructing an ESM study. Not all studies necessarily need to be that complex, and others may be even more sophisticated. When pondering the different options, one may want to consider the additional time that is invested in programming or setting up each feature (e.g., how long does it take to implement an additional access window or reminder?) against its potential use (e.g., how often will this feature be relevant, and what happens if it is not present?). If the intended sample is highly compliant, then reminders may not be relevant. On the other hand, if the sample is very large and managing individual participants becomes impossible, then one should automate as many aspects of the study as possible. Increasing participant retention. During experience sampling, the burden on the participant is comparatively high. Filling out a questionnaire between 15 and 160 times is obviously not very engaging. There are many potential ways to increase participant retention. First, participants can be addressed by their first name or their nickname when sending the invitations. Second, Horstmann and colleagues (2020) gave participants a random fact from psychological research

(that was unrelated to the research question) after each assessment. Although it is empirically unclear whether this has led to higher participation, some participants feedbacked that they found this interesting and engaging. Furthermore, participants could indicate when they need a break from the repeated assessment (like a snooze button, e.g., before going into the weekend). This may of course lead to some bias in the situations sampled, yet, if the alternative is losing the participant, this can be preferable. Most of these strategies have not been tested empirically, but as long as the cost of their implementation is low compared to losing participants, they can be useful. 4. Repeated Assessment Daily. During experience sampling, it is also possible to provide an end-of-day survey (or any other survey that is distributed once a day, or once a week, etc.), such as a daily diary portion. Such a survey could serve the purpose of obtaining more information on the participants’ overall assessment of their day, whether they have achieved what they had planned to achieve, and so forth. Additionally, this can also be used to obtain information on goals for the next day. However, recent research has also shown that the assessments that rely on retrospective reports must not necessarily overlap with assessments that were conducted during the day (Lucas, Wallsworth, Anusic, & Donnellan, 2020). It cannot be said which of the two is more accurate, but in any case, both provide somewhat different information. Having defined the research question and the use of the measure can be useful to choose which of the two methods is more appropriate. 5. Exit Assessment. The experience sampling phase is usually completed after participants have responded to a certain number of surveys or participated for a certain time (see Table 3). Once participants have completed the daily and repeated assessment, a one-time

III. Methods and statistics

Designing an experience sampling study

TABLE 3

805

Questions and possible options for the set-up of an experience sampling study.

Question

Possible options/issues to consider

When are the participants invited the first time?

• • • •

How are participants invited?

• Email • SMS • In app

How often per day should participants be invited?

• Time based (e.g., every 3 h) • Event based (e.g., after each social interaction)

How long is the survey? (incl. planned missingness)

• Number of items, response format, planned missingness • Number of survey pages • Maximum time to complete survey

Can participants skip items?

• All items mandatory • Skip a percentage/fixed number of items • Skip all items

Can participants skip individual surveys?

• No possibility to skip • Skip whenever required • Skip only at certain times, locations, etc.

How long after each invitation can participants take the survey (access windows)?

• Short interval (lower flexibility, higher control, potential frustration if access window is missed) • Long interval (more flexibility, lower control)

If they do not react, are they reminded (reminders)?

• Reminder, e.g., after 20 min (potentially reactive) • No reminder (missed invitations)

After how many total/missed reminders is a participant excluded?

• Automatic exclusion after  missed reminders • Automatic exclusion after  days in study

Can participants get stuck?

• If internet breaks down • If participant is interrupted • If participant misses invitation

Can participants opt out?

• Possibility to opt out at each measurement occasion

Can participants get spammed (accidentally)?

• Accidental spamming (e.g., falsely programmed reminder) • Spamming if participant is no longer active

When are participants finished?

• • • •

Directly after completion of initial assessment Fixed date (e.g., weekday) Based on their own choice At random the next day

After  completed measures After  invitations After  days When certain event occurred (date, transition, life event, number of informant reports completed, etc.)

exit assessment could be conducted. Here, participants could be asked about their overall experience during the study (how they liked it, if they have feedback, etc.), and

they could be required to report on the same characteristics as they did at the initial assessment (for the estimation of test-retest reliability or stability). Finally, it is also

III. Methods and statistics

806

30. Experience sampling and daily diary

possible to ask participants if they would be willing to participate in further studies, which would technically allow complementing the rather complex data they just provided with pieces that might answer further research questions in the future. 6. Feedback and Debriefing. Once participants have completed the study and responded to all questions, they can be debriefed and thanked. The debriefing, as with any study, should provide them with information about the purpose of the study and the exact research question that was examined. Although this information must also be given at the beginning of a study, it is now possible to be more specific and also refer to items from the study. At this stage it is also possible to provide participants with feedback. Personalized feedback can increase interest in the study, and participants are more likely to share the study with others if they get feedback on their scores. However, not every participant wants to learn about their characteristics and behaviors, and participants should therefore be asked if they would like to receive a detailed feedback on different characteristics. When providing potentially harmful feedback, it is important to make sure that participants are able to interpret feedback correctly. This can be ensured by explaining what the feedback will tell them and what it will not tell them. Subsequently, participants can be asked to answer questions regarding the interpretation of the feedback, and they can only proceed to the feedback if the questions have been answered correctly. Additionally, it is important that participants have the option to ask questions or get further clarification. When providing feedback, the correct wording is also vitally important. Most

psychological test procedures result in a score that is only interpretable by comparing the participant to a reference group. In these cases, it is more accurate to say that the participant described himself or herself as, for example, “more extraverted compared to the reference group”. This implies that it is (a) a self-description, and that it (b) depends on the reference with which participants are compared. Lastly, several (screening-)instruments that have been developed for psychological research purposes may not be suitable for individual assessment, and one may debate whether these scores should be shared with participants at all.

Detailed setup Fig. 3 gives a more detailed picture for an ESM setup, including all six stages of the study. Several eventualities are considered here. If the contact information of a participant does not work (e.g., wrong email or wrong phone number at the initial assessment) the participant will receive a different debriefing (at 6) compared to a participant who has completed all stages of the survey. Next, this set-up checks what time of the day it currently is, avoiding sending emails during night-time (this may be especially relevant for text messages). If a participant does not react within a certain time to a survey, they will also receive a reminder. However, if the participant reacts, they do not receive a reminder. Finally, this setup checks after each end-of-day survey if the participant has completed enough hourly surveys, and in case they have completed all surveys, they proceed to the exit survey. This setup only provides a very rough outline of all the possible options. At the same time, it gives an insight into the problems that may be faced when setting up a basic ESM study. Anticipating these on paper (i.e., simply by drawing such a

III. Methods and statistics

807

Preregistration of experience sampling studies 1. Registration of Participants

Registration Survey

Check: does phone number/ email work?

3. Repeated Assessment – Hourly

4. Repeated Assessment – Daily

Invite participant (SMS, email)

Wait until begin, e.g., Monday

Wait until daytime/morning

Invite participant (SMS, email)

Invite participant (SMS, email)

Reaction?

Reminder invitation

Reaction?

2. Assessment of Time Invariant Characteristics

End-of-day Survey

Reminder invitation

Enough surveys completed?

Hourly Survey Survey with traitmeasures and demographics

Wait until next survey

5. Exit Assessment

Is it end of day?

Reaction?

Reminder invitation

Exit Survey

6. Feedback and Debriefing Feedback and debriefing

Debriefing

FIG. 3 Detailed set-up of an experience sampling study.Note. (1) and (2) are initial assessment, (3) the ESM part of the study, (4) the daily diary part of the study, (5) the exit assessment at the end of the study, (6) the debriefing of participants at the end of the study. A solid arrow indicates a “yes” response to the respective question, a dashed arrow a “no.” A light gray box with a solid black frame represents a circle that is visible to the participant. All other light gray boxes represent modules that are evaluated in the background.

figure) will definitely save a lot of time during the setup and pilot phase of the study.

Preregistration of experience sampling studies Preregistration describes the process of data collection, data processing, and data analysis before data collection begins, or before substantial knowledge about a data set has been obtained (van’t Veer & Giner-Sorolla, 2016). The goals of preregistration are twofold. First, preregistration clearly separates exploratory from confirmatory research. Second, it restricts

researcher degrees of freedom (Simmons, Nelson, & Simonsohn, 2011). Researcher degrees of freedom come into play when there are many different ways to analyze the same data set (even for the same research question) and researchers can flexibly choose among them. As data from ESM studies are comparatively complex, the possibilities to analyze (and preprocess) data are nearly endless (Bastiaansen et al., 2020). To illustrate this, consider the data set that we would have obtained using the design described earlier, assuming we had collected information on the person, such as their Big Five personality traits, using the BFI-2 (Soto & John, 2017). During ESM,

III. Methods and statistics

808

30. Experience sampling and daily diary

we could then have collected corresponding self-reported behaviors: How open, conscientious, extraverted, agreeable, and emotionally stable did participants express themselves? If we had assessed each behavior with three items, and participants were allowed to skip individual items at each assessment occasion, and if we were to examine the relation between a self-reported personality trait and selfreported average behavior, we would already have to make a lot of decisions. For example, how are the scores for the personality traits computed (e.g., factor scores [which measurement model?], sum scores, or average scores?). Next, how many measurement occasions must participants have completed to be included in the main analysis? How are missing data treated (at the between- and the within-person level)? Is the correlation between traits and average behaviors estimated based on latent variables or based on manifest variables? And, if it is based on manifest variables, is it necessary to correct the correlation for attenuation? If so, which estimate of reliability (e.g., alpha, omega, nested alpha) should be used to correct it? While preregistering, it is usually not a problem to make a decision or find an answer to these questions. The problem lies in finding all relevant questions that need to be answered during the preregistration of the study. Given the diverse options, endless possibilities, and challenges that come attached during experience sampling, there are mainly three different options on how to detect most questions that should be answered. First, one can simply try to foresee all different possibilities and do a mental walk-through of the data cleaning, preprocessing, and analysis. Although this may be possible, it will probably be very difficult and is easier for researchers who have already conducted ESM studies. Second, one

can search for published research articles that have investigated a very similar research question. If the authors have shared their data and analysis script(s), it should be possible to adapt at least some solutions to one’s own research question. Finally, it is possible to generate a data set that is similar to the one that will finally be collected. This can be done by simulating the data (Mathieu, Aguinis, Culpepper, & Chen, 2012; Silvia et al., 2014). Conducting a Monte Carlo simulation may furthermore be the only way one can estimate power in ESM studies. Alternatively, and especially if a simulation is too complicated, it is possible to generate data simply by participating several times in one’s own study and then doing a mock analysis with those data to test the properties and implications of certain data-analytical decisions. ESM studies can be preregistered like any other study, for example on the Open Science Framework (Foster & Deardorff, 2017). Kirtley, Lafit, Achterhof, Hiekkaranta, and Myin-Germeys (2019) provide more detailed information on the specifics of preregistering ESM studies.

Pilot testing an experience sampling study When testing or piloting a technically complex study such as an ESM study, it is helpful to imagine several different types of users. This does not only generate pilot data that can be used to specify elements of the preregistration, but it can also reveal flaws in the study design. Table 4 gives an overview of different prototypical users that can potentially participate and should be mimicked when pilot testing an ESM study. Of course, a user may combine features of the mentioned characteristics. However, the technical setup of any ESM study should be able to deal with these users automatically and

III. Methods and statistics

Platforms for experience sampling studies

TABLE 4

809

Different hypothetical types of users in experience sampling studies.

User

Characteristics

Flaws that can be discovered

Compliant

Takes part whenever invited, answers all questions

Minimum number of surveys, rushing through study, minimum interval between measurements

Semicompliant

Takes part in most surveys, but misses some

Premature exclusion of participant

Registrant

Registers, but never takes part in the study again

Exclusion of participant, spamming, sending reminders

Annoyed

Registers, but wants to leave the study right away

Options to opt out and be deleted at any stage

Data protection officer

Takes part, but only responds to the items that are not optional during registration

Robustness of study if information is missing, debriefing if participant submits wrong information, possibility to opt out from personalized feedback

Lazy

During experience sampling, does never respond to the first invitation

Maximum number of reminders, maximum interval between measurement occasions, exclusion when missing too many reminders, message when taking part too late

Forgetful

Takes part in experience sampling, forgets to submit surveys, may lose a phone

User gets stuck, wants to continue even after missing too many reminders

Item skipper

Only responds to some items each time

Handling of missing data, differentiation of planned missing data and real missing data

Sloppy

Types in wrong contact information

Exclusion of participants, avoiding spamming, double registrations

Incompetent

Does not know how to respond to surveys, call (via phone), or send an email

Clarity of instructions, options to get help, maximum (realistic) response time

Note. Although these users can be real, it can also be possible to mimic users with these characteristics while piloting and preregistering an experience sampling study.

without human intervention. Furthermore, a preregistration of an ESM study should describe how different data sets of these hypothetical users will be processed and analyzed. Anticipating these different users can therefore increase the validity of the preregistration.

Platforms for experience sampling studies When selecting a platform for an ESM study, several aspects of the study can be considered (see Table 5). For a start, Arslan, Walther, and

Tata (2019) presented a survey tool, formr.org, that fulfills most requirements sketched, and also lists further tools and platforms with different advantages and disadvantages. First, it is important to check if one is actually capable to set up the study with the skills and knowledge that one already possesses or is willing to learn within a reasonable amount of time. This decision can also be influenced by the time (potentially available) support takes to help with the study setup or during an ongoing study if technical problems occur (e.g., in case emails are no longer sent and participants start leaving

III. Methods and statistics

810 TABLE 5

30. Experience sampling and daily diary

Features to consider when selecting a platform for an experience sampling study.

Feature

Description

Usability

Difficulty to set up and maintain an ongoing study, time and effort it takes to get help or support

Flexibility

Flexibility to integrate all aspects of the study (e.g., surveys, reminders, access windows to surveys, upload of pictures, customization, experience sampling, personalized feedback)

Costs

Costs of setting up the study, costs to use platform

Online/offline

Possibility to collect data offline (i.e., without an internet connection)

Installation, requirements

Difficulty to install the platform on participant’s devices (if not own device is used), requirements by participants to take part (e.g., operating system, apps, etc.)

Contact with participant

Mode of contact with participant (i.e., registration, reminder during experience sampling), ability to send SMS, email, or remind participant offline

Integration with other services

Ability to integrate with other platforms or data modes of data collection (e.g., mobile sensing)

Replicability

Availability of code and platform to other researches to replicate study

Data protection

Protection of data, security of infrastructure and data transmission, full disclosure of which data are (secretly) collected, transparency, possibility to self-host study (i.e., on own server), back-up in case of data loss

the study). Second, and similarly important, a platform should be able to accommodate all requirements of the study. Some required features (e.g., the ability to integrate informant reports or to capture participants’ geolocation) can be essential, and if a platform does not offer these features, it simply cannot be used. Third, the costs of using a platform can inform the study given the available resources. Costs do not only refer to the money charged to use the platform, but also to the opportunity costs (e.g., time, energy, resources) when choosing a do-it-yourself solution where a lot of effort is required to set up the study. Fourth, it may be relevant to use a platform that allows offline data collection (i.e., participants can submit responses without internet connection). This may be relevant if participants live in an area with bad internet connection. Fifth, one should ponder the requirements that need to be met for a participant to use the platform. Some platforms may only be available for certain systems

(Meers, Dejonckheere, Kalokerinos, Rummens, & Kuppens, 2020), and others may require an active setup or installation or even a lab visit by the participant. Sixth, it is important to consider how the participants are contacted during registration (e.g., lab visit or by registering through a website), and how they are then reminded to participate (e.g., via email, text messages, or reminders such as push notifications on their smartphones). Next, some studies can require an integration with other services (e.g., Amazon’s MTurk for participant recruitment, the ability to use mobile sensing, or the ability to capture the geolocation of the participant). Eighth, the chosen platform should allow for some replicability: Is it possible to permanently store the study design, to share it with others, and for others to (freely) use it to replicate the study? Ultimately, replicability will be higher if the platform does not require buying licenses. The final aspect, and of considerable importance, is the protection of the participants’

III. Methods and statistics

References

data. Concerning data security, one should first examine which data are actually collected (and, if this is not possible to find out, do not consider the platform anymore), and second, it should be transparent with whom the data are shared (e.g., does the service provider keep a copy, and if so, of which data?). At the same time, data should be reasonably backed up in the system, without the user having to do so manually.

Summary ESM studies can provide data for the description and explanation of psychological phenomena and processes, such as person-situation dynamics. Designing and conducting such studies gets easier and easier, and current literature allows building on other researchers’ experiences. ESM studies can be implemented at nearly every level of complexity, and the level of complexity strongly depends on the research question that is to be answered, the use of the scores that are obtained during sampling, and the targeted population of persons (and their anticipated competencies) and situations. When setting up an ESM study, many obstacles and challenges can be foreseen and solved (or avoided) by doing a mental walk-through through the study, the data preprocessing, and the data analysis.

References Arslan, R. C., Reitz, A., Driebe, J., Gerlach, T. M., & Penke, L. (2020). Routinely randomise the display and order of items to estimate and adjust for biases in subjective reports. Retrieved from: https://psyarxiv.com/va8bx. Arslan, R. C., Schilling, K. M., Gerlach, T. M., & Penke, L. (2018). Using 26,000 diary entries to show ovulatory changes in sexual desire and behavior. Journal of Personality and Social Psychology. https://doi.org/10.1037/ pspp0000208. Arslan, R. C., Walther, M. P., & Tata, C. S. (2019). formr: A study framework allowing for automated feedback generation and complex longitudinal experiencesampling studies using R. Behavior Research Methods. https://doi.org/10.3758/s13428-019-01236-y.

811

Augustine, A. A., & Larsen, R. J. (2012). Is a trait really the mean of states? Journal of Individual Differences, 33, 131–137. https://doi.org/10.1027/1614-0001/a000083. Bastiaansen, J. A., Kunkels, Y. K., Blaauw, F. J., Boker, S. M., Ceulemans, E., Chen, M., … Bringmann, L. F. (2020). Time to get personal? The impact of researchers’ choices on the selection of treatment targets using the experience sampling methodology. Journal of Psychosomatic Research, 110211. https://doi.org/10.1016/j. jpsychores.2020.110211. Baumert, A., Schmitt, M., Perugini, M., Johnson, W., Blum, G., Borkenau, P., … Wrzus, C. (2017). Integrating personality structure, personality process, and personality development. European Journal of Personality, 31, 503–528. https://doi.org/10.1002/per.2115. Bevans, G. E. (1913). How workingmen spend their spare time. Columbia University. Retrieved from: https://archive.org/ details/howworkingmenspe00bevarich. Bleidorn, W. (2009). Linking personality states, current social roles and major life goals. European Journal of Personality, 23, 509–530. https://doi.org/10.1002/per.731. Buschek, D., V€ olkel, S., Stachl, C., Mecke, L., Prange, S., & Pfeuffer, K. (2018). Experience sampling as information transmission. In: Proceedings of the 2018 ACM international joint conference and 2018 international symposium on pervasive and ubiquitous computing and wearable computers— UbiComp ’18, New York, USA: ACM Press, pp. 606–611. https://doi.org/10.1145/3267305.3267543. Chittaranjan, G., Blom, J., & Gatica-Perez, D. (2013). Mining large-scale smartphone data for personality studies. Personal and Ubiquitous Computing, 17, 433–450. https://doi. org/10.1007/s00779-011-0490-1. Colom, R., Bensch, D., Horstmann, K. T., Wehner, C., & Ziegler, M. (2019). Special issue “The ability–personality integration” Journal of Intelligence, 7, 13. https://doi.org/ 10.3390/jintelligence7020013. Duckworth, A. L., Quinn, P. D., Lynam, D. R., Loeber, R., & Stouthamer-Loeber, M. (2011). Role of test motivation in intelligence testing. Proceedings of the National Academy of Sciences, 108, 7716–7720. https://doi.org/10.1073/ pnas.1018601108. Eisele, G., Vachon, H., Lafit, G., Kuppens, P., Houben, M., Myin-Germeys, I., & Viechtbauer, W. (2020). The effects of sampling frequency and questionnaire length on perceived burden, compliance, and careless responding in experience sampling data in a student population. Retrieved from: https://doi.org/10.31234/osf.io/zf4nm. Epstein, S. (1980). The stability of behavior: II. Implications for psychological research. American Psychologist, 35, 790–806. Finnigan, K. M., & Vazire, S. (2018). The incremental validity of average state self-reports over global self-reports of personality. Journal of Personality and Social Psychology, 115, 321–337. https://doi.org/10.1037/pspp0000136. Fleeson, W. (2001). Toward a structure- and process-integrated view of personality: Traits as density distributions of states.

III. Methods and statistics

812

30. Experience sampling and daily diary

Journal of Personality and Social Psychology, 80, 1011–1027. https://doi.org/10.1037/0022-3514.80.6.1011. Fleeson, W., & Jayawickreme, E. (2015). Whole trait theory. Journal of Research in Personality, 56, 82–92. https://doi. org/10.1016/j.jrp.2014.10.009. Fl€ ugel, J. C. (1925). A quantitative study of feeling and emotion in everday life. British Journal of Psychology General Section, 15, 318–355. https://doi.org/10.1111/j.20448295.1925.tb00187.x. Foster, E. D., & Deardorff, A. (2017). Open science framework (OSF). Journal of the Medical Library Association. 105, https://doi.org/10.5195/JMLA.2017.88. Geukes, K., Nestler, S., Hutteman, R., Dufner, M., K€ ufner, A. C. P., Egloff, B., … Back, M. D. (2017). Puffed-up but shaky selves: State self-esteem level and variability in narcissists. Journal of Personality and Social Psychology, 112, 769–786. https://doi.org/10.1037/pspp0000093. Geukes, K., Nestler, S., Hutteman, R., K€ ufner, A. C. P., & Back, M. D. (2017). Trait personality and state variability: Predicting individual differences in within- and crosscontext fluctuations in affect, self-evaluations, and behavior in everyday life. Journal of Research in Personality, 69, 124–138. https://doi.org/10.1016/j.jrp.2016.06.003. Hampel, R. (1977). Adjektiv-Skalen zur Einsch€atzung der Stimmung (SES). Diagnostica, 23, 43–60. Harari, G. M., M€ uller, S. R., Stachl, C., Wang, R., Wang, W., B€ uhner, M., … Gosling, S. D. (2019). Sensing sociability: Individual differences in young adults’ conversation, calling, texting, and app use behaviors in daily life. Journal of Personality and Social Psychology. https://doi.org/ 10.1037/pspp0000245. Haselton, M. G., & Gangestad, S. W. (2006). Conditional expression of women’s desires and men’s mate guarding across the ovulatory cycle. Hormones and Behavior, 49, 509–518. https://doi.org/10.1016/j.yhbeh.2005.10.006. Hektner, J., Schmidt, J., & Csikszentmihalyi, M. (2007). Experience sampling method. Thousand Oaks, CA: SAGE Publicationshttps://doi.org/10.4135/9781412984201. Hofmann, W., Vohs, K. D., & Baumeister, R. F. (2012). What people desire, feel conflicted about, and try to resist in everyday life. Psychological Science, 23, 582–588. https:// doi.org/10.1177/0956797612437426. Hofmann, W., Wisneski, D. C., Brandt, M. J., & Skitka, L. J. (2014). Morality in everyday life. Science, 345, 1340–1343. https://doi.org/10.1126/science.1251560. Hofmans, J., De Clercq, B., Kuppens, P., Verbeke, L., & Widiger, T. A. (2019). Testing the structure and process of personality using ambulatory assessment data: An overview of within-person and person-specific techniques. Psychological Assessment, 31, 432–443. https:// doi.org/10.1037/pas0000562. Horstmann, K. T., & Rauthmann, J. F. (in preparation). How many states make a trait? A comprehensive metaanalysis of experience sampling studies.

Horstmann, K. T., Rauthmann, J. F., & Sherman, R. A. (2018). Measurement of situational influences. In V. Zeigler-Hill & T. K. Shackelford (Eds.), The SAGE handbook of personality and individual differences (pp. 465–484). SAGE Publications. Horstmann, K. T., Rauthmann, J. F., Sherman, R. A., & Ziegler, M. (2020). Unveiling an exclusive link: Predicting behavior with personality, situation perception, and affect in a pre-registered experience sampling study. Journal of Personality and Social Psychology, in press. Horstmann, K. T., & Ziegler, M. (2019). Situational perception and affect: Barking up the wrong tree? Personality and Individual Differences, 136, 132–139. https://doi.org/ 10.1016/j.paid.2018.01.020. Horstmann, K. T., & Ziegler, M. (2020). Assessing personality states: What to consider when constructing personality state measures. European Journal of Personality. https:// doi.org/10.1002/per.2266. Horstmann, K. T., Ziegler, J., & Ziegler, M. (2018). Assessment of situational perceptions. J. F. Rauthmann, R. Sherman, & D. C. Funder (Eds.), The Oxford handbook of psychological situations. Vol. 1. Oxford: Oxford University Press. https://doi.org/10.1093/oxfordhb/ 9780190263348.013.21. Human, L. J., Mignault, M. -C., Biesanz, J. C., & Rogers, K. H. (2019). Why are well-adjusted people seen more accurately? The role of personality-behavior congruence in naturalistic social settings. Journal of Personality and Social Psychology, 117, 465–482. https://doi.org/10.1037/ pspp0000193. Jayawickreme, E., Zachry, C. E., & Fleeson, W. (2019). Whole trait theory: An integrative approach to examining personality structure and process. Personality and Individual Differences, 136, 2–11. https://doi.org/10.1016/j. paid.2018.06.045. Kirtley, O. J., Lafit, G., Achterhof, R., Hiekkaranta, A. P., & Myin-Germeys, I. (2019). Making the black box transparent: A pre-registration template for studies using experience sampling methods (ESM). https://doi.org/10.31234/osf.io/seyq7. Lucas, R. E., Wallsworth, C., Anusic, I., & Donnellan, M. B. (2020). A direct comparison of the day reconstruction method (DRM) and the experience sampling method (ESM). Journal of Personality and Social Psychology. https://doi.org/10.1037/pspp0000289. Mathieu, J. E., Aguinis, H., Culpepper, S. A., & Chen, G. (2012). “Understanding and estimating the power to detect cross-level interaction effects in multilevel modeling”: Correction to Mathieu, Aguinis, Culpepper, and Chen (2012). Journal of Applied Psychology, 97, 981. https://doi.org/10.1037/a0029358. Meers, K., Dejonckheere, E., Kalokerinos, E. K., Rummens, K., & Kuppens, P. (2020). mobileQ: A free user-friendly application for collecting experience

III. Methods and statistics

References

sampling data. Behavior Research Methods, 52(4), 1510–1515. https://doi.org/10.3758/s13428-019-01330-1. Meindl, P., Jayawickreme, E., Furr, R. M., & Fleeson, W. (2015). A foundation beam for studying morality from a personological point of view: Are individual differences in moral behaviors and thoughts consistent? Journal of Research in Personality, 59, 81–92. https://doi.org/ 10.1016/j.jrp.2015.09.005. Nezlek, J. B. (2017). A practical guide to understanding reliability in studies of within-person variability. Journal of Research in Personality, 69, 149–155. https://doi.org/ 10.1016/j.jrp.2016.06.020. Oud, J. H. L., Voelkle, M. C., & Driver, C. C. (2017). SEM based CARMA time series modeling for arbitrary N. Multivariate Behavioral Research, 17, 1–21. https://doi.org/ 10.1080/00273171.2017.1383224. Podsakoff, N. P., Spoelma, T. M., Chawla, N., & Gabriel, A. S. (2019). What predicts within-person variance in applied psychology constructs? An empirical examination. Journal of Applied Psychology, 104, 727–754. https://doi.org/ 10.1037/apl0000374. Quintus, M., Egloff, B., & Wrzus, C. (under review). Momentary processes predict long-term development in explicit and implicit representations of Big Five traits: An empirical test of the TESSERA framework. Rammstedt, B., & Kemper, C. J. (2011). Measurement equivalence of the big five: Shedding further light on potential causes of the educational bias. Journal of Research in Personality, 45, 121–125. https://doi.org/10.1016/j. jrp.2010.11.006. Rauthmann, J. F., Gallardo-Pujol, D., Guillaume, E. M., Todd, E., Nave, C. S., Sherman, R. A., … Funder, D. C. (2014). The situational eight DIAMONDS: A taxonomy of major dimensions of situation characteristics. Journal of Personality and Social Psychology, 107, 677–718. https://doi.org/10.1037/a0037250. Rauthmann, J. F., Horstmann, K. T., & Sherman, R. A. (2019). Do self-reported traits and aggregated states capture the same thing? A nomological perspective on trait-state homomorphy. Social Psychological and Personality Science, 10, 596–611. https://doi.org/ 10.1177/1948550618774772. Rauthmann, J. F., Horstmann, K. T., & Sherman, R. A. (2020). The psychological characteristics of situations: Towards an integrated taxonomy. In J. F. Rauthmann, R. A. Sherman, & D. C. Funder (Eds.), The Oxford handbook of psychological situations. Oxford: Oxford University Press. https://doi.org/10.1093/oxfordhb/ 9780190263348.013.19. Rauthmann, J. F., Jones, A. B., & Sherman, R. A. (2016). Directionality of person-situation transactions: Are there spillovers among and between situation experiences and personality states? Personality and Social Psychology Bulletin, 42, 893–909. https://doi.org/ 10.1177/0146167216647360.

813

Rauthmann, J. F., & Sherman, R. A. (2016). Ultra-brief measures for the situational eight DIAMONDS domains. European Journal of Psychological Assessment, 32, 165–174. https://doi.org/10.1027/1015-5759/a000245. Rauthmann, J. F., Sherman, R. A., & Funder, D. C. (2015). Principles of situation research: Towards a better understanding of psychological situations. European Journal of Personality, 29, 363–381. https://doi.org/10.1002/per.1994. Roemer, L., Horstmann, K. T., & Ziegler, M. (2020). Sometimes hot, sometimes not: The relations between selected situational vocational interests and situation perception. European Journal of Personality. per.2287https://doi.org/10.1002/per.2287. Rogers, K. H., & Biesanz, J. C. (2019). Reassessing the good judge of personality. Journal of Personality and Social Psychology, 117(1), 186–200. https://doi.org/10.1037/ pspp0000197. Sauermann, H., & Roach, M. (2013). Increasing web survey response rates in innovation research: An experimental study of static and dynamic contact design features. Research Policy, 42, 273–286. https://doi.org/10.1016/j. respol.2012.05.003. Schoedel, R., Au, Q., V€ olkel, S. T., Lehmann, F., Becker, D., B€ uhner, M., … Stachl, C. (2018). Digital footprints of sensation seeking. Zeitschrift f€ ur Psychologie, 226, 232–245. https://doi.org/10.1027/2151-2604/a000342. Sch€ onbrodt, F. D., Zygar, C., Nestler, S., Pusch, S., & Hagemeyer, B. (submitted). Measuring motivational relationship processes in experience sampling: A reliability model for moments, days, and persons nested in couples. 10.31234/osf.io/6mq7t. Scollon, C. N., & Kim-Prieto, C. (2003). Experience sampling: Promises and pitfalls, strengths and weaknesses. Journal of Happiness Studies, 4, 5–34. https://doi.org/10.1023/ A:1023605205115. Sherman, R. A., Rauthmann, J. F., Brown, N. A., Serfass, D. G., & Jones, A. B. (2015). The independent effects of personality and situations on real-time expressions of behavior and emotion. Journal of Personality and Social Psychology, 109, 872–888. https://doi.org/10.1037/pspp0000036. Shiffman, S., Stone, A. A., & Hufford, M. R. (2008). Ecological momentary assessment. Annual Review of Clinical Psychology, 4, 1–32. https://doi.org/10.1146/annurev. clinpsy.3.022806.091415. Silvia, P. J., Kwapil, T. R., Walsh, M. A., & Myin-Germeys, I. (2014). Planned missing-data designs in experiencesampling research: Monte Carlo simulations of efficient designs for assessing within-person constructs. Behavior Research Methods, 46, 41–54. https://doi.org/10.3758/ s13428-013-0353-y. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). Falsepositive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. https://doi. org/10.1177/0956797611417632.

III. Methods and statistics

814

30. Experience sampling and daily diary

Soto, C. J., & John, O. P. (2017). The next Big Five Inventory (BFI-2): Developing and assessing a hierarchical model with 15 facets to enhance bandwidth, fidelity, and predictive power. Journal of Personality and Social Psychology, 113, 117–143. https://doi.org/10.1037/ pspp0000096. Stachl, C., Hilbert, S., Au, J. -Q., Buschek, D., De Luca, A., Bischl, B., … B€ uhner, M. (2017). Personality traits predict smartphone usage. European Journal of Personality, 31, 701–722. https://doi.org/10.1002/per.2113. Statistisches Bundesamt (2018). Ausstattung privater Haushalte mit Informations- und Kommunikationstechnik - Deutschland. Steger, D., Schroeders, U., & Wilhelm, O. (2019). On the dimensionality of crystallized intelligence: A smartphone-based assessment. Intelligence, 72, 76–85. https://doi.org/10.1016/j.intell.2018.12.002. Stone, A. A., Wen, C. K. F., Schneider, S., & Junghaenel, D. U. (2020). Evaluating the effect of daily diary instructional phrases on respondents’ recall time frames: Survey experiment. Journal of Medical Internet Research, 22, e16105. https://doi.org/10.2196/16105. Sun, J., & Vazire, S. (2019). Do people know what they’re like in the moment? Psychological Science, 30, 405–414. https://doi.org/10.1177/0956797618818476. van Berkel, N., Goncalves, J., Loven, L., Ferreira, D., Hosio, S., & Kostakos, V. (2019). Effect of experience sampling schedules on response rate and recall accuracy of objective self-reports. International Journal of HumanComputer Studies, 125, 118–128. https://doi.org/ 10.1016/j.ijhcs.2018.12.002. van’t Veer, A. E., & Giner-Sorolla, R. (2016). Pre-registration in social psychology—A discussion and suggested template. Journal of Experimental Social Psychology, 67, 2–12. https://doi.org/10.1016/j.jesp.2016.03.004.

Voelkle, M. C., & Oud, J. H. L. (2013). Continuous time modelling with individually varying time intervals for oscillating and non-oscillating processes. British Journal of Mathematical and Statistical Psychology, 66, 103–126. https://doi.org/10.1111/j.2044-8317.2012.02043.x. Vogelsmeier, L. V. D. E., Vermunt, J. K., van Roekel, E., & De Roover, K. (2019). Latent Markov factor analysis for exploring measurement model changes in time-intensive longitudinal studies. Structural Equation Modeling: A Multidisciplinary Journal, 26, 557–575. https://doi. org/10.1080/10705511.2018.1554445. Wheeler, L., & Reis, H. T. (1991). Self-recording of everyday life events: Origins, types, and uses. Journal of Personality, 59, 339–354. https://doi.org/10.1111/j.1467-6494.1991. tb00252.x. Wilt, J. A., & Revelle, W. (2015). Affect, behaviour, cognition and desire in the big five: An analysis of item content and structure. European Journal of Personality, 29, 478–497. https://doi.org/10.1002/per.2002. Wrzus, C., & Mehl, M. R. (2015). Lab and/or field? Measuring personality processes and their social consequences. European Journal of Personality, 29, 250–271. https://doi. org/10.1002/per.1986. Wrzus, C., & Roberts, B. W. (2017). Processes of personality development in adulthood: The TESSERA framework. Personality and Social Psychology Review, 21, 253–277. https://doi.org/10.1177/1088868316652279. Ziegler, M. (2014). Stop and state your intentions!. European Journal of Psychological Assessment, 30, 239–242. https:// doi.org/10.1027/1015-5759/a000228. Ziegler, M., Horstmann, K. T., & Ziegler, J. (2019). Personality in situations: Going beyond the OCEAN and introducing the Situation Five. Psychological Assessment, 31, 567–580. https://doi.org/10.1037/pas0000654.

III. Methods and statistics

C H A P T E R

31 Modeling developmental processes Jens B. Asendorpf Department of Psychology, Humboldt University of Berlin, Berlin, Germany

O U T L I N E Individual, mean, and differential change Mean level, positional and rank-order stability, and linear change Methods for describing developmental trajectories Multilevel models Latent growth curve models Autoregressive models Comparison of multilevel, growth curve, and autoregressive models Person-centered approaches: Development of types and types of development

816 816 818 819 821 823

Models for explaining developmental trajectories Statistically controlled long-term prediction Intervention studies Natural experiments Crosslagged analyses Longitudinal mediation

825 826 827 828 828 831

Conclusion

832

References

833

824 824

Subsequently, I contrast three variable-centered models for describing differential change with each other (multilevel, latent growth curve, and autoregressive models) and outline two person-centered approaches to the description of change (development of personality types and types of personality development). Finally, I discuss four approaches to explaining personality development (statistically controlled prediction, intervention studies, natural experiments, and crosslagged analysis, including longitudinal mediation) with an eye on differences between within-person and between-person mechanisms of change.

Abstract In this chapter, I discuss the main methods that are currently used to describe, predict, and explain long-term development, with a particular focus on between-person differences in individual trajectories of change. First, I distinguish between three concepts of change (individual, mean, and differential change) and associated concepts of stability (mean-level and positional/rank-order stability), and highlight the equivalence of between-person differences in change and change in betweenperson differences in the case of linear change.

The Handbook of Personality Dynamics and Processes https://doi.org/10.1016/B978-0-12-813995-0.00031-5

815

# 2021 Elsevier Inc. All rights reserved.

816

31. Modeling developmental processes

Keywords Stability, Change, Development, Personality trait, Personality type, Multilevel analysis, Growth curve analysis, Autoregressive models, Crosslagged analysis, Longitudinal mediation

Individual, mean, and differential change According to a famous quotation, “Every man is in certain respects (a) like all other men, (b) like some other man, (c) like no other man” (Kluckhohn & Murray, 1961, p. 53). General psychology is concerned with Point (a), considering between-person differences as measurement error, whereas most personality research is concerned with Point (b), focusing on betweenperson differences in recurrent behaviors or experiences that show high short-term temporal stability and significant cross-situational consistency (personality traits). The restriction to traits is not an unduly narrow perspective on personality if the definition is understood broadly, including abilities, motives, attitudes, self-concept, selfesteem, and narratives about the self as long as their high short-term stability has been shown. This variable-centered approach to personality differences is sometimes complemented with a person-centered approach that studies the within-person organization of traits in terms of trait profiles and their grouping into personality types (Asendorpf, 2015). Lastly, single-case (“idiographic”) studies are concerned with Point (c) (e.g., historiometry; see Simonton, 1990, 1998). If we apply this view to long-term developmental change, everyone develops in certain respects (a) like everyone else, (b) like some others, and (c) like no one else. Developmental psychology is concerned with all three perspectives on development, whereas personality development research focuses on changes in traits and types. In the case of Point (a), the focus is on changes in the mean level of traits or the

prototypical profile and size of types (mean-level change). In the case of Point (b), the focus is on between-person differences in changes in traits or the membership in types (differential change). In the case of Point (c), the focus is on the change of a trait or a trait profile within a single individual (individual change). Mean-level change ignores differential change, differential change contrasts individual change with mean-level change, and individual change confounds mean-level and differential change (i.e., if you only know how an individual changes, you do not know to which extent this change is shown by everyone else, by some others, or by no one else). The difference between the three perspectives on personality development is illustrated with Fig. 1. In the case depicted in Fig. 1, no individual change was observed for a particular member of a birth cohort (the individual trajectory showed neither an increase nor a decrease) but because a mean-level change of the trait occurred in the birth cohort, evidenced by a mean increase in the trait, the individual showed differential change.

Mean level, positional and rank-order stability, and linear change If we do not contrast one individual with the mean of the cohort but two individuals of the same cohort, we can easily see that differential changes are principally independent of mean changes (see Fig. 2). In the case depicted in Fig. 2, no mean change occurred, but two individuals showed (opposite) differential change. From another perspective, the position of the individuals on the trait dimension changed so much that their rank order also changed (i.e., Individual 2 ranked higher in the trait than Individual 1 at an early age, but lower at a later age). Graphically speaking, the rank order of two individuals changes if their developmental

III. Methods and statistics

Mean level, positional and rank-order stability, and linear change

817

Mean-level change

Trait

Individual change

Differential change

Age

FIG. 1

No individual change but strong mean-level and differential change.

FIG. 2

No mean-level change but strong differential change.

trajectories cross each other. The positional stability of individuals is often called rank-order stability, although this is somewhat misleading because perfect rank-order stability can go along with slight differential changes that do not change the rank order (i.e., the trajectories are not parallel but do not cross). Positional instability and differential change are two equivalent views about personality

development if the individual developmental trajectories can be described by linear functions of time of the form y ¼ a + bx, where x is time or a transformation of time, a is the intercept (where the line starts at x ¼ 0), and b is the slope (the increase of y for a 1-unit increase in x). In this case, between-person differences in change are equivalent to change of between-person differences because

III. Methods and statistics

818

31. Modeling developmental processes

ða1 + b1 xÞ  ða2 + b2 xÞ ¼ ða1  a2 Þ + ðb1  b2 Þx, (1) where the left side describes between-person differences in change and the right side describes change of between-person differences (i.e., change of differences in intercepts and slopes). Therefore, in the case of linear individual trajectories, mean individual change between two time points t1 and t2 is identical with change in the means at t1 and t2 (mean-level change). Because x can be any transformation of time (e.g., x ¼ time2, x ¼ etime), linear models of change and the equivalence of between-person differences in change to change of between-person differences are more general than they may seem. Linear models of change have an important advantage for personality research, which is often based on arbitrary response scales such as Likert-type scales where the zero point and the unit of the scale have no clear psychological interpretation. In these cases, multiplication of all scores of all individuals X with a constant c, or adding a constant c to all individuals (linear transformations of the scale), change the means M as well as standard deviations SD of X and of changes in X in simple ways: MðcXÞ ¼ cMðXÞ,Mðc + XÞ ¼ c + MðXÞ;SDðcXÞ ¼ cSDðXÞ,SDðc + XÞ ¼ SDðXÞ (2) MðcX1  cX2 Þ ¼ cMðX1  X2 Þ,M½ðc + X1 Þ ðc + X2 Þ ¼ MðX1  X2 Þ; SDðcX1  cX2 Þ ¼ cSDðX1  X2 Þ, SD½ðc + X1 Þ  ðc + X2 Þ ¼ SDðX1  X2 Þ (3) Because of these advantages, the present chapter discusses only linear models of change. In models of change that are described by nonlinear functions of time such as continuous time models (see Voelkle, Oud, Davidov, & Schmidt, 2012), the effects of linear transformations of the measurement scale are far from being understood. Thus, these nonlinear models should be used with extreme caution by personality researchers.

Methods for describing developmental trajectories In this section, I outline methods that describe individual, differential, and mean trajectories over a developmental time scale such as years or many months (developmental trajectories). It is important to note at the outset that these trajectories (a) describe long-term change of shortterm stable traits, and (b) assume true differential change (“true” means after correction for unreliability of the measurements). Assumption (b) sets models of developmental trajectories apart from models of short-term change that include time-invariant trait factors (see Chapter 36). Assumption (b) is not a strong assumption because true differential trait change, or less than perfect true positional stability, is regularly observed in studies of personality development if the retest interval is long enough. Meta-analyses of the rank-order stability of heterogeneous traits (Ferguson, 2010; Roberts & DelVecchio, 2000) or nationally representative studies of the rank-order stability of the Big Five traits (Specht, Egloff, & Schmukle, 2011; Wortman, Lucas, & Donnellan, 2012) show differential change over a period of 4 to 7 years after correction for attenuation by measurement error. For example, the disattenuated four-year rank-order stabilities in a nationally representative German study varied between .64 (conscientiousness) and .74 (Extraversion), and showed in most cases an inverse U-shaped age dependency over adulthood such that the stabilities were highest at midlife and lowest in young adults and the elderly (Specht et al., 2011). Even for the most stable trait extraversion, the disattenuated four-year stability peaked below .80. These findings, among others, go to show that differential change is ubiquitous in long-term studies of personality development. Four main approaches to describing developmental trajectories can be distinguished that have their own pros and cons, and result under certain assumptions in equivalent models of

III. Methods and statistics

Methods for describing developmental trajectories

change: multilevel, latent growth curve, autoregressive, and person-centered methods. All approaches are based on longitudinal data because cross-sectional data cannot be used for describing individual or differential change. Sometimes cross-sectional age comparisons are used for describing mean changes, but results confound developmental changes of individuals with cohort effects. For example, if one compares in 2018 the scores of 18-, 28-, 38-, 48-, and 58-year olds in narcissism and finds decreasing narcissism with increasing age, this can be due to decreasing narcissism with increasing developmental age or to increasing narcissism over historical time (adults born in 2000 are more narcissistic at any age than adults born in 1960; see Twenge, Konrath, Foster, Campbell, & Bushman, 2008).

every individual has their own equation with their own intercept and slope, and the time points ti can vary across individuals (as they are not necessarily the same for every individual). The possibility of between-individual variations in the timing of the assessments distinguishes multilevel models from repeatedmeasures ANOVAs where every individual is observed at fixed time points t. In the standard model, it is assumed that the errors eti are independent across time and show the same variance across time. The individual intercepts and slopes obtained at Level 1 can be predicted at Level 2 with two equations that can include time-invariant predictors (constant individual characteristics such as gender, ethnicity, or age at the beginning of the study and so on). In case of one predictor Z, the two equations read π 0i ¼ β00 + β01 zi + r0i ðLevel 2Þ π 1i ¼ β10 + β11 zi + r1i

Multilevel models Longitudinal multilevel models are discussed in every textbook on multilevel analysis (e.g., Hox, 2010; Raudenbush & Bryk, 2002). Multilevel models were developed to describe nested data structures where lower-order units are nested in higher-order units (e.g., students in classrooms). In longitudinal multilevel models, time points at Level 1 are nested in individuals at Level 2 (two-level models); in addition, individuals at Level 2 may be nested in groups at Level 3 (three-level models; e.g. a longitudinal study of students nested in classrooms). Consider first a two-level longitudinal multilevel model. It is described by three regression equations. The Level 1 equation describes the change of individuals i in a dependent variable Y across time points T yti ¼ π 0i + π 1i ti + eti ðLevel 1Þ where yti is the value of individual i in the observed variable Y at time ti, π 0i and π 1i represent the intercept and linear slope for individual i, respectively, and eti is the time- and individual-specific measurement error. Thus,

819

where β00 and β01 represent the intercept and slope, of the regression that predicts the individual intercepts from the individual values in the predictor Z, and β10 and β11 are the intercept and linear slope of the regression that predicts the individual linear slopes from the individual values in the predictor Z. More predictors can be entered into the Level 2 equations accordingly, resulting in additional Level 2 linear slopes. Effects (means, standard errors, etc.) based on the β coefficients are called fixed effects of the model because they describe properties of the sample, not of individuals. In contrast, effects based on the e and r coefficients are called the random effects of the model. β00 is the overall intercept and β10 is the overall linear slope of the sample. Because an intercept is the value of the dependent variable at the zero point of the predictor, it represents the mean if the predictor is centered at the mean score of the raw scores of the predictor. Therefore, the meaning of the Level 2 intercept depends on how time and the predictors at Level 2 are centered. It is

III. Methods and statistics

820

31. Modeling developmental processes

useful to center all variables at levels higher than 1 at the mean of the raw scores such that the intercepts of higher levels can be interpreted as means at time 0. An exception to the aforementioned centering recommendation are models where the Level 2 predictors are dummy variables for distinct groups (e.g., 1 ¼ member of group, 0 ¼ not member of group). In this case, the dummy variables should not be centered and the intercept represents the group not described by the dummy variables. For example, in a study of differential change based on sexual orientation, bisexual and homosexual can be coded each with a dummy variable; consequently, the overall intercept is the outcome at time zero for heterosexuals (see Hox, 2010, Appendix C for alternative methods of coding categorical data). Various options for centering time are discussed later. Longitudinal two-level models can be easily extended to longitudinal three-level models if the individuals are nested in groups (e.g., classrooms, work teams, neighborhoods, etc.). The intercepts and slopes at Level 2 (the βs) are predicted at Level 3 by variables on which the Level 3 units may vary (e.g., characteristics of classrooms such as teaching style, ethnic composition, etc.). Sample sizes It is important to note that all statistics for fixed effects depend on the number of units at the highest level; as a rule of thumb, this number should be 30 or higher, and the total number of assessments should be at least 900. For example, in a longitudinal study of students nested in classrooms, at least 30 classrooms are required. The number of units at lower levels is less important such that even a two-level study with only two waves is possible if the total number of individuals is at least 450 because in this case the total number of assessments is 900 if no missing values occurred (see Hox, 2010, Chapter 12, for a more detailed discussion).

Assumptions Longitudinal multilevel models make two implicit assumptions. First, they assume measurement equivalence across time, that is, the same assessment instrument or different instruments used at different time points have the same validity for each time point. Whereas measurement equivalence can be assumed for short-term studies, it can be of concern in studies covering a large age interval, even if the same instrument is used at all assessments (see Bollen & Curran, 2006, for a discussion of measurement equivalence across time). Second, standard applications of multilevel analysis assume uncorrelated measurement errors across time. This assumption is often violated if the assessments are closely spaced in time but rarely in long-term studies with retest intervals of months if not years; therefore, this second assumption can be readily made in studies of long-term development. The assumption that the trajectories are linear in time is not as restrictive as it may seem because time can be transformed (e.g., squared time centered at zero tests quadratically decreasing growth). Further, different Level 1 predictors based on time can be used such as π 1iti + π 2it2i (in this case an individual i may show only a linear increase because π 2i ¼ 0 and another individual j may show only a quadratically decreasing growth because π 1i ¼ 0). Centering time An important decision in multilevel longitudinal studies concerns how time is centered. If the longitudinal observation starts or ends at a psychologically meaningful time point (e.g., birth, starting to attend a particular class, separation from the partner, death), it is useful to center time at the first or last assessment (thus the first or last assessment is coded as 0). If the observation starts at a more arbitrary point in time (e.g., a study of changes over adolescence between ages 12 and 18 where the participants start puberty at different points in time), it is more useful to center time at

III. Methods and statistics

Methods for describing developmental trajectories

the midpoint of the observed interval (thus, a study with 5 yearly assessments is centered at the 3rd assessment). This choice has the additional advantage that the intercepts are less correlated with the slopes. Similar but not identical is grand-mean centering where time is centered at the mean assessment point of all individuals. Due to missing data or different spacing of the assessments, the mean assessment point can be different from the midpoint of the observation such that its interpretation is less obvious. Not useful in most cases is centering time within each individual (person-mean centering, often called “groupmean centering” where, in the present case, “group” refers to individuals) because in this case the individual Level 1 intercepts can refer to different time points such that these differences are confounded with other interindividual differences. Person-mean centering is only useful if the zero point is psychologically meaningful for each individual and varies between individuals (e.g., if data before and after the individually determined onset of puberty are analyzed). Selective attrition A frequent point of concern in longitudinal studies is biased results due to selective dropout of the participants (selective attrition). For example, if low-achieving participants drop out from school or refuse to further participate in the study more frequently than highachieving participants (which is often the case in longitudinal studies), the variance of achievement at later assessments is restricted, which leads in ANOVAs or ordinary regressions to an underestimation of effects on later achievement. One approach to avoid such biases is estimating the missing assessments with (multiple) imputation (see Asendorpf, van de Schoot, Denissen, & Hutteman, 2014). If, however, the data are analyzed with longitudinal multilevel models, imputation is not necessary because the estimation procedure corrects for selective drop-out as far as it is related to the first assessment (see Hox, 2010; Little, 2013).

821

Other missing data Although missing data at the lowest level do not present a major problem in multilevel analysis, they do present a problem at higher levels because they reduce the number of included higher-level units and all lower-level assessments of these units. Thus, if a time-invariant Level 2 predictor has a missing score for one particular individual, all data of that individual are excluded, and the bias due to such missing values is not corrected without further adjustments. Therefore, it is useful to impute missing values at higher levels, and the best method is multiple imputation using all available variables, even if they are not included in the model (see Asendorpf et al., 2014, and Graham, 2009, for nontechnical overviews and practical recommendations). Depending on the multilevel software, multiple imputation is done before the multilevel analysis is run, or it is an option in the multilevel analysis procedure.

Latent growth curve models These models are very similar to longitudinal two-level models (see, e.g., Bollen & Curran, 2006; Little, 2013). They are structural equation models (SEMs) with a fixed time schedule for all participants. The individual intercepts and the individual slopes are treated as latent factors of the observed manifest assessments of the dependent variable Y across the time points T: yti ¼ λ0i intercepti + λ1i slopei  + eti latent growth curve model where yti is the value of individual i on the observed variable Y at time t, the λ0i and λ1i represent the factor loadings for the intercept and linear slope factors, and eti is the time- and individual-specific measurement error. The equation describes the measurement model for the latent intercept and slope factors (see Fig. 3). All loadings on the intercept factor are constrained to be equal to 1, such that this factor is

III. Methods and statistics

822

FIG. 3

31. Modeling developmental processes

A five-wave latent growth curve model with one time-invariant moderator M.

equivalent to the coefficient π 0i in the Level 1 equation of a multilevel model. The loadings on the slope factor are constrained to be equal to 0, 1, 2, 3, … (omitting a path in a SEM is equivalent to including the path fixed to zero). Thus, the factor loadings for the slope factor play the role of the time variable ti in the Level 1 equation of a multilevel model, and the slope factor plays the role of the slope coefficient π 1i in the Level 1 equation (see Hox, 2010, for a detailed description of this equivalence). The intercept and slope can be predicted by one or more time-invariant variables (thus, constant moderators). In the case of one timeinvariant variable M, the equations for the two predictions are completely equivalent to the Level 2 equations of the two-level model that describes the same data. The only difference between the standard multilevel and the standard latent growth curve model is that the error variances in the multilevel model are assumed to be equal; if they are constrained to be equal in the latent growth curve

model, the multilevel and the latent growth curve model yield identical results. Later I contrast the advantages and disadvantages of multilevel and latent growth curve models. One advantage of the latent growth curve model is that measurement error can be controlled with multiple indicators of latent variables. Multiple item parcels of a scale control for the internal inconsistency of the scale, multiple judges (e.g., mother and father ratings of children) control for judge-specific biases, and repeated assessments of the same variable within a few weeks control for fluctuations due to the current situation or affective state of the judges or the judged participants. In longterm studies covering many years, factorial invariance should be tested by comparing a constrained model with equal loadings of the indicators on a latent variable across time with an unconstrained model where the loadings can vary across time. If the constrained model does not fit worse than the unconstrained model, factorial invariance can be assumed. In any case,

III. Methods and statistics

Methods for describing developmental trajectories

indicator-specific stabilities should be modeled either with correlated errors across waves or with indicator-specific measurement factors (see Geiser & Lockhart, 2012, for a comparison of different methods of dealing with measurement error). If multiple indicators are not available (e.g., short scales with relatively low internal consistency), known short-term retest unreliability from representative subsamples of the study or other studies with similar samples can be used for controlling measurement error. One latent variable is assigned to each manifest variable and the error variance is fixed to (1 – R)s2, where R is the reliability of the scale and s2 is the variance of the manifest variable (Hayduk, 1987). This trick avoids multiple assessments of the same variables in each wave for all participants.

Autoregressive models Whereas multilevel and growth curve models can be used to study person-specific trajectories across the full age range covered by the study, autoregressive models are based on the alternative perspective of “piecewise” change between two subsequent waves (see Eq. (1) in the earlier discussion of differential change). Studies of the positional stability of traits are based on such models. Because they analyze correlations between waves, they ignore changes in the variance of the assessments. Autoregressive models are based on the more general case of linear regressions between waves (called autoregressions because a variable is regressed on itself at an earlier time). Autoregressive models take into account differences in variance; however, the standardized autoregression coefficients are identical with the stability correlations obtained from correlational studies (see Little, 2013; Singer & Willett, 2003). Similar to growth curve models, autoregressive models are SEMs based on manifest variables or on latent variables, and in the latter case the same considerations

823

of controlling measurement error apply as for growth curve models. Standard autoregressive models do not assume stationarity (i.e., equal variances and autoregressions across time), but stationarity can be tested by comparing a stationary model with the unrestricted standard model. If the stationary model does not fit worse, stationarity can be assumed and the variances and autoregressions can be constrained to be equal across time. In this case, a more parsimonious model with more robust autoregressions results that should be preferred as the final model. If stationarity is not found in the case of more than two retest intervals (thus, more than three waves), it is useful to search for stationary pieces of the study in order to get a more parsimonious and robust description. One advantage of autoregressive models is that one can easily model stabilizing and destabilizing influences with lag2 autoregressions. Whereas the ordinary lag1 autoregressions regress between-person differences at time tn+1 on these differences at time tn, the lag2 autoregressions regress in addition the betweenperson differences at time tn+2 on these differences at time tn. Positive lag2 autoregressions indicate stabilizing influences, negative lag2 autoregressions would indicate destabilizing influences but are rarely observed. In lag2 autoregressive models, it makes no sense to include the first lag1 regression in a test of stationarity because it is not controlled for a lag2 regression whereas all later lag1 regressions are controlled for lag2 regressions. Instead, stationarity is shown if the lag2 regressions and all lag1 regressions excluding the first lag1 regression can be assumed constant across time. Significant positive lag2 autoregressions suggest that the assessments of traits are confounded with a destabilizing factor. For example, assessments of traits can be confounded with the mood on the day of measurement; on good days one may tend to overestimate one’s standing on socially desirable traits, on bad days

III. Methods and statistics

824

31. Modeling developmental processes

one may tend to underestimate. In this case, the lag1 autoregressions underestimate the true stability of the traits because they are affected by differences in the states at the two assessments (to be more precise: by differential change in the states). The lag2 autocorrelations correct for this effect.

Comparison of multilevel, growth curve, and autoregressive models The strength and weaknesses of the three modeling approaches are summarized in Table 1. The main advantage of multilevel models is that modeling more than two levels is easy and straightforward, that no fixed time schedule for all individuals is required, and nonlinear growth curves can be easily modeled using transformations of time. The main advantage of growth curve and autoregressive models is that measurement error can be easily controlled and additional latent variables can be easily added. Long-term trends over the full course of the study are best modeled with multilevel or latent growth curve models, whereas nonstationary changes from wave to wave are best modeled with autoregressive models (see, e.g., Hox, 2010, Chapter 16).

Person-centered approaches: Development of types and types of development All earlier presented models are variablecentered because they focus on individual trajectories in only one variable. Person-centered approaches focus on more complex descriptions of individuals, mostly in terms of profiles of scores in many variables (see Asendorpf, 2015, for an overview). The profiles can be based on scores on many variables, each assessed with a personality scale (e.g., a Big Five questionnaire), or “idiographically” by a Q-sort (Block, 1961, 2008) where each individual is described by a knowledgeable informant who sorts attributes, such as trait descriptions, according to how well they fit the target-individual’s personality. The resulting Q-sort describes the relative salience of the attributes for that individual and thus a person-centered personality profile. The profiles can be standardized by instructing the Q-sort judges to assign for each individual the same number of attributes to each category of salience (i.e., equal within-person distribution). For example, in the California Child Q-Set (CCQ; Block & Block, 1980), 100 brief descriptions of traits on which children may vary are sorted

TABLE 1 Advantages of multilevel (ML), latent growth curve (LGC), and autoregressive (AR) models for describing developmental trajectories. Advantage

ML

LGC

AR

Nesting of individuals in higher-order units

+





No fixed time schedule

+





Nonlinear growth curves

+





Controlling measurement error



+

+

Adding latent variables



+

+

Long-term trends

+

+



Nonstationarity





+

Note. + ¼ advantage given or issue considered;  ¼ no advantage or issue not considered.

III. Methods and statistics

Models for explaining developmental trajectories

into categories of increasing salience such that each category contains the same number of traits. Such an equal, forced distribution maximizes the within-individual variance, which is desirable from a person-centered perspective because it leads to maximally differentiated personality descriptions. A similar Q-sort is available for adults (California Adult Q-Set [CAQ]; Block, 1961, 2008). Based on such profiles, replicable personality types (i.e., groups of persons with similar profiles) have been derived with cluster analyses (e.g., Asendorpf, Borkenau, Ostendorf, & van Aken, 2001; Asendorpf & van Aken, 1999) or latent class analyses (Meeus, Van de Schoot, Klimstra, & Branje, 2011; Specht, Luhmann, & Geiser, 2014). Each type can be described by a prototypical profile (which is the mean profile of the members of the type). Depending on the method of deriving the types and the criterion for determining the optimal number of types, different types can be derived from the same data. In most cases, three types have been found: resilients (R), undercontrollers (U), and overcontrollers (O) (the RUO types; see for an overview Specht et al., 2014). Nevertheless, even in large representative samples where the types are derived from Big Five profiles with the same statistical method, the types can be different if the scales on which the profiles are based show different intercorrelations (Specht et al., 2014). Types and intercorrelations can be considered alternative methods of describing the multivariate distribution of the variables on which the profiles are based such that differences between typologies can be expected to be accompanied by differences in the intercorrelations of the variables and vice versa. The RUO types can be used in developmental studies for predicting either long-term outcomes of types (see, e.g., Asendorpf & Denissen, 2006; Chapman & Goldberg, 2011; Denissen, Asendorpf, & van Aken, 2008) or developmental changes of the membership of individuals in the types (e.g., shift from an undercontroller to a

825

resilient person). In the latter case, it is important to note that those changes confound individual changes in type membership with changes in the prototypical profiles if the types are derived separately for different time points (see Asendorpf, 2015). Therefore, the stationarity of the prototypical profiles (i.e., invariance across time) should be tested, and if stationarity can be assumed, the prototypical profile should be fixed to be equal across time (see Specht et al., 2014), or the types can be defined by a prototypical profile that is a priori defined by extant research (see Meeus et al., 2011). For stationary profiles, type membership shows a high stability already over adolescence (Meeus et al., 2011) and increases into adulthood (Specht et al., 2014). Changes between the types can be determined in terms of latent transition probabilities (e.g., probability to switch from an undercontroller to a resilient person; see Meeus et al., 2011; Specht et al., 2014). Persons can be grouped into personality types based on similar personality profiles but also based on similar individual developmental trajectories. For example, one may group them into increasers, decreasers, continuously high scorers, continuously low scorers, etc. Attempts to empirically distinguish developmental types can be traced back to Block (1971) who distinguished different developmental trajectories based on Q-sorts, and Magnusson (1988) who clustered adolescents according to their developmental trajectories. More recently, growth mixture modeling (GMM) is increasingly used to identify latent classes of individual trajectories (see for the procedure Jung & Wickrama, 2008, and for an example Schaeffer, Petras, Ialongo, Poduska, & Kellam, 2003).

Models for explaining developmental trajectories In the previous section, I have presented methods for describing developmental trajectories: multilevel, latent growth curve,

III. Methods and statistics

826

31. Modeling developmental processes

autoregressive, and person-centered methods. In this section, I present models aimed at explaining developmental trajectories. Because explanations are based on causal effects but causal effects have been a kind of taboo topic for personality psychology until recently, a general note on causal effects is in order here. If we apply the concept of causality to developmental trajectories, we have to explain individual, mean, and differential trajectories with causal mechanisms that change individuals (within-person mechanisms) and that make individuals different from one another (betweenperson mechanisms). If we ignore the latter, we can explain mean trajectories in terms of within-person mechanisms that apply to all persons. If we include the latter, we can explain mean trajectories in terms of the net result of within- and between-person mechanisms. For example, if we are interested in the changes in neuroticism over young adulthood, we observe a mean decrease between ages 18 and 28 and a rather low true stability of neuroticism in this age period, indicating a lot of differential change (e.g., Specht et al., 2011). Thus, the decrease in neuroticism cannot be due solely to general within-person mechanisms; we have to search for mechanisms that lead to a differential change in neuroticism such as engaging versus not engaging in the first stable partnership (Neyer, Mund, Zimmermann, & Wrzus, 2014). Engaging in the first partnership is a within-person change that may explain decreased neuroticism through increased security of attachment, but there is a large variance when this happens that can be explained only by between-person mechanisms. In studies of development, the four most often used research designs with accompanying statistical models are controlled long-term prediction, intervention studies, natural experiments, and crosslagged analyses, including longitudinal mediation. Because causes require time to show effects, all designs are longitudinal.

Statistically controlled long-term prediction In this often-used research design, a predictor describing a between-person variation is used for predicting subsequent long-term development, which is described with one of the earlier presented models. Thus, a moderator of the individual trajectories is added to these models, which is typically assessed in the first wave of the study. Thereby the models move from describing development to predicting development. However, they are far from explaining development because the prediction may be due to an unmeasured variable that predicts both the predictor and developmental change. For example, Chassin, Curran, Hussong, and Colder (1996) studied with a latent growth curve model adolescents’ substance use across three waves. Alcoholism of the biological father predicted both a higher level and an increase in substance use. The effects remained significant after adding additional predictors, namely, alcoholism of the mother, gender, age, and parental antisocial and affective disorder. After these controls, one can be more certain that the effects were causal effects related to fathers’ alcoholism (including genetic similarity with the father) but because the adolescents were not randomly assigned to their biological fathers, other unobserved variables might fully or partly explain the effects. It is important to note that such control variables should precede the predictor. If they are a consequence of the predictor, one would control for a mediator, which can seriously bias the results (a so-called collider; Asendorpf, 2012; Lee, 2012). For example, substance use of friends in Wave 2 of the study should not be controlled, and controlling for substance use of friends in Wave 1 would be a questionable procedure because the substance use of friends may partly depend on the participants. If no a priori causal assumptions are made and the models involve variables where it is not clear whether they

III. Methods and statistics

Models for explaining developmental trajectories

should be considered a predictor or an outcome, researchers can easily be lost in covariation, relying on traditional routines designed for the control of certain predictor variables although the controls might be outcomes in the particular study. In these cases, other methods such as crosslagged analyses are more appropriate (see later section).

Intervention studies In intervention studies, the participants are randomly assigned to an intervention group versus a control group to study effects of the intervention on an outcome in terms of the difference between the two groups (randomized control trial, RCT). This design is considered the gold standard in applied areas where interventions are used to change persons in an intended direction (e.g., psychotherapy, coaching, medication) because, due to the randomization, RCT effects cannot be caused by unobserved variables. This is the great advantage of RCTs compared to statistically controlled prediction. In many RCTs, later follow-ups are included to study long-term effects of the intervention. In this case, the dummy variable for the groups (1 for intervention or treatment, 0 for control) can be used as a predictor of the initial level (effect at the end of the intervention) and the individual slopes of the outcome (change after the intervention). If the aim of the intervention is to change a trait or an environmental characteristic of the participants in the treatment group, intervention effects seem to be similar to between-person differences in the trait or environment. However, they should not be confused with them because treatment effects in an RCT are within-person changes. This becomes obvious when the outcome variable is measured before the intervention. Because the assignment is random, the intervention group and the control group are expected to score equally on the outcome variable before the intervention

827

but differently after the intervention if it shows an effect; thus, the causal effects captured by RCTs are within-person effects even though the design is a between-person one. Therefore, explaining effects of observed, nonmanipulated betweenperson differences with effects observed in RCTs is prone to misinterpretation. For example, drinking alcohol decreases one’s IQ score compared to being sober during the test (Tzambazis & Stough, 2000). Would this withinperson effect be generalized to between-person differences, one would expect that higher regular alcohol consumption is negatively correlated with IQ. However, the opposite was found. Higher IQ predicted in a nationally representative study increased likelihood of having tried alcohol and other recreational drugs even after controlling for education and income (Wilmoth, 2012), and a higher childhood IQ predicted increased alcohol consumption in later life (Batty et al., 2008). Between-person effects are identical with within-person effects only under the rare condition of ergodicity (see Molenaar & Campbell, 2009). If a significant part of a between-person effect is due to genetic differences, this part cannot be explained with RCTs because they do not change the genome. Today, RCTs can only explain environmental effects. Because of this problem, RCTs cannot be considered the gold standard for explaining personality development; instead, their results are prone to misinterpretation unless one can be sure that a good deal of the between-person differences were caused by the within-person mechanism(s) studied with the RCT. In addition, it is difficult to find adults that would accept that their nonpathological traits or nonrisk environments are changed for the purpose of an experiment (but see Hudson & Fraley, 2015, for an intervention study targeting participants who wished to change some of their traits in the normal range, and Roberts et al., 2017, for a meta-analysis of the effects of mainly clinical interventions on personality change).

III. Methods and statistics

828

31. Modeling developmental processes

Natural experiments Natural experiments (also called quasiexperiments; Shadish, Cook, & Campbell, 2002) are similar to RCTs except that the assignment to the intervention and the control group is not random. Instead, it is only assumed that the assignment to the two compared groups is initially statistically independent from the outcome variable. The initial independence sets natural experiments apart from statistically controlled predictions and the not guaranteed randomness from RCTs; in terms of strictness of control, they are thus intermediate between statistically controlled prediction and RCTs. The study by Hudson and Fraley (2015) is a quasiexperiment because the participants selected themselves into the intervention conditions. Many genetically sensitive designs such as twin or adoption studies can be considered natural experiments (see Rutter, 2007, for an overview). To give only one example, in a monozygotic control-twin design, genetic effects are controlled by comparing monozygotic twins with one another; the observed differences between them are only environmental (see, e.g., Caspi et al., 2004) because their genome is identical. Natural experiments are often used in research on risk and resilience where the between-group variation is a naturally occurring environmental risk or protective experience that cannot be experimentally induced for practical or ethical reasons. For example, Neyer and Asendorpf (2001) compared two groups in emerging adulthood who had not engaged in a first stable partnership; their initial neuroticism scores were virtually identical and higher than those of peers in a stable partnership. Four years later, the neuroticism scores of the singles who continued to be singles remained high, whereas the neuroticism scores of those who had meanwhile engaged in a stable partnership decreased to the level initially shown by the participants in a stable partnership. This effect could be replicated

for the remaining singles in the subsequent retest interval and in another longitudinal study with a different sample (Neyer et al., 2014). The effect suggests that engaging in a first stable partnership explains to some extent both the decreasing level and the differential change in neuroticism over young adulthood (Specht et al., 2011).

Crosslagged analyses Crosslagged analysis couples the autoregressive models of two variables X and Y with regressions between subsequent waves (crosslagged regressions, see Fig. 4 for the case of three time points). The key idea is that causal effects require some time to unfold such that a time lag can be used to disentangle the direction of causality between X and Y to some extent (e.g., Gollob & Reichardt, 1987; Little, 2013). The logic underlying cross-lagged effects can be illustrated with the following thought experiment. Assume that the first variable at the first assessment X1 has no causal effect on the other variable Y2, and Y2 is not causally influenced by unobserved variables. In that case, X1 and Y2 are linked by two possible paths (see Fig. 4). One is the indirect path r1d1 from X1 to Y2 through Y1, and the other one is the direct path c1 from X1 to Y2. The direct path is zero in this case because X1 has no causal effect on Y2 and no unobserved variables affect this path. If the initial correlation r1 and the autoregression d1 of Y2 on Y1 are both nonzero, the indirect path is nonzero. Thus, X1 can show a spurious effect on Y2 based on the indirect path. Now assume that X1 has a causal effect on Y2 that can be captured by a linear regression. In this case, the direct effect is nonzero and independent from the effect based on the indirect path r1d1. This direct effect is the crosslagged effect of X1 on Y2; it captures the causal effect of X1 on Y2 and controls for the indirect path. Similarly, a causal effect of Y1 on X2 is captured by the crosslagged effect b1 of Y1 on X2.

III. Methods and statistics

Models for explaining developmental trajectories

FIG. 4

829

A three-wave crosslagged model.

Crosslagged analysis reverses the logic of the thought experiment. If we find a crosslagged effect of X1 on Y2, we often conclude that a causal effect has occurred because we know that the effect cannot be due to the initial correlation and autoregression, and other knowledge about the data is not available. However, because the variation in X1 was not experimentally induced, the crosslagged effect may be due to an unobserved variable influencing both X1 and Y2. Therefore, the effect should be considered a quasicausal effect rather than a causal effect (Shadish et al., 2002). At time t ¼ 2, the concurrent zero-order correlation between X2 and Y2 consists of two parts. One part is the correlation induced by the preceding paths in the model. The other part is the correlation r2 between the residuals of X2 and Y2. This latter correlation captures external influences on the preceding synchronic change of X and Y (i.e., effects of unobserved variables that led to the same changes in X and Y). For example, if X is parenting style and Y is a trait of the child and both are rated by the parent, the current emotional state of the parent might bias the assessments of X2 and Y2. The same logic underlies the crosslagged effects b2 and c2 from t ¼ 2 to t ¼ 3; they are controlled for the correlation between X2 and Y2 (not only for r2) and the autoregressions a2 and d2.

In the case of long-term development, three typical applications of crosslagged analysis can be distinguished. First, many studies are concerned with the transaction between enduring environmental characteristics and personality traits, such as consumed media content and aggressiveness (see review by Krahe, 2014), quality of social relationships and personality (see review by Neyer et al., 2014), or parenting and child temperament (see review by Kiff, Lengua, & Zalewski, 2011). The main question driving these studies is whether there is evidence for effects of between-person differences in the enduring environment on personality differences (socialization effects) and/or vice versa (environment selection effects due to personality differences). Second, other studies focus on the transaction between traits such as trait self-esteem and nonclinical depression (Orth & Robins, 2013), or the Big Five factors of personality and nonclinical depression and aggression (Klimstra, Akse, Hale III, Raaijmakers, & Meeus, 2010). The main questions driving these studies is whether there is evidence that certain traits are risk factors for negatively valued traits, protect from developing or buffer negatively valued traits, or promote positively valued traits, sometimes linking more than two traits (see, e.g., the developmental cascade model linking academic

III. Methods and statistics

830

31. Modeling developmental processes

achievement with internalizing and externalizing tendencies by Masten et al., 2005). All these studies focus on effects of the level of a predictor on an outcome (level effects). Only recently another causal question has been studied using crosslagged effects, namely, the effects of change in a predictor on subsequent change in an outcome (change-to-change effects; Grimm, An, McArdle, Zonderman, & Resnick, 2012). Whereas the original study predicted declines in memory performance in the elderly from preceding increasing lateral ventricle size in the brain, Mund and Neyer (2014) studied changeto-change effects in transactions between the Big Five personality traits and various measures of social relationship quality from early to midadulthood. These studies use an equivalence between crosslagged models and bivariate latent change score models. The methodological considerations concerning the control of measurement error, factorial invariance and stationarity discussed in the section on autoregressive models apply to crosslagged models as well because they are based on two autoregressive models. Stationarity includes in the case of crosslagged models also the stationarity of the crosslagged regressions. Another specific point to consider are the effect sizes of the crosslagged regressions. Because crosslagged regressions are controlled for the product of the initial correlation and the autoregression of the outcome, which is often of medium size (e.g., .40  .75 ¼ .30 in r units), the effect sizes for crosslagged effects are often small if one would judge them according to Cohen’s (1988) classification of effect sizes for zero-order correlations (.10 small, .30 medium, .50 large). In fact, it seems misleading to apply Cohen’s guidelines directly to crosslagged effects (Adachi & Willoughby, 2015). A reasonable approach is dividing the standardized crosslagged regression by 1  rd, where r is the initial correlation and d is the

standardized autoregression of the outcome, and to apply Cohen’s rules to this adjusted coefficient. Consider a case in which two variables X and Y are concurrently correlated at .40 at t1, and the autoregression of the outcome variable is .75. In this case, rd ¼ .30, meaning that coefficients need to be compared for 1–.30 ¼ .70 to obtain a benchmark of effect size. Thus, a crosslagged effect of .07 would correspond to a small effect size (.10  .70), .21 would be medium-sized (.30  .70), and .35 would be large (.50  .70). More recently, some authors proposed to control for the person-means across time in crosslagged models (e.g., Bainter & Howard, 2016; Berry & Willoughby, 2017; Hamaker, Kuiper, & Grasman, 2015). This is similar to using ipsatized scores or person-mean centered scores (see the section on multilevel analysis). Whereas person-mean centering is useful for intensive longitudinal studies (see Chapter 32 of this handbook) where the person-mean has a clear psychological interpretation (it is a trait), the interpretation of the person-mean in longterm studies is far from being clear because differential change is not an oscillation around a constant such as in short-term studies of states but a continuous drift away from the initial between-person differences (see the section on differential stability). More importantly, controlling for the person-means implies that a large part of the between-person variance is lost for the crosslagged effects such that these effects are severely underestimated. For an illustration, Brummelman et al. (2015) found in a 4-wave study effects of parental overvaluation on children’s narcissism for both mother and father overvaluation but not vice versa. Not surprisingly, overvaluation was related to parental narcissism (Brummelman, Thomaes, Nelemans, Orobio de Castro, & Bushman, 2015). Parental narcissism and other parental traits led to relatively stable between-person differences in overvaluation. A reasonable assumption is that it is not the

III. Methods and statistics

Models for explaining developmental trajectories

within-parent variation in overvaluation over the 18 months of the study that made children more or less narcissistic but it is the chronicity of overvaluation that makes children narcissistic, and this chronicity is captured only if the full range of between-person differences in chronic overvaluation is taken into account. Because the chronicity of overvaluation would be the critical causal factor, it should be part of the variation on which the crosslagged effect is based. Now let us assume that Brummelman, Thomaes, Nelemans, de Orobio Castro, et al. (2015) would have person-centered parental overvaluation, studying crosslagged effects between withinparent overvaluation and children’s narcissism. In that case, all constant between-person differences in overvaluation are lost in the crosslagged effects, including all constant differences in parental narcissism; what is left are the within-person variations in overvaluation between the waves of the study. Therefore, person-mean centering or similar approaches should not be used in long-term studies of personality development.

Longitudinal mediation Longitudinal mediation models are “double” crosslagged models (see Fig. 5 and Reitz, Motti-Stefanidi, & Asendorpf, 2016, for an example study). The effect of X at time 1 affects the mediator M at time 2 which, in turn, affects the outcome Y at time 3, and vice versa. In the model shown in Fig. 5, the longitudinal mediation effect from X to Y is tested with the product of c1f2; c0 is the remaining direct lag2 effect from X to Y. The reverse effect is tested with e1b2; the remaining lag2 effect is c00 . As longitudinal mediation models are extensively discussed by Cole and Maxwell (2003) and most basic features were already discussed in the section on crosslagged analysis, I focus here only on a few specific points to be made about these models.

831

First, longitudinal mediation models are quasicausal models, whereas cross-sectional mediation models are not even quasicausal models because they are silent about the direction of causality, just as any concurrent correlation. Cross-sectional mediation is currently overestimated as a method to detect causal mechanisms (see Maxwell & Cole, 2007, for a comparison of cross-sectional and longitudinal mediation). Longitudinal mediation is clearly superior as it can disentangle the direction of effects. Second, whereas mediation from X to Y can be tested ignoring X3 and Y1, and mediation from Y to X can be tested ignoring X1 and Y3, it is useful to include all 9 assessments as shown in Fig. 5 because both directions of mediation can be simultaneously tested and the sizes of the mediation effects can be compared directly. Also, if stationarity of the model is confirmed (i.e., all auto- and crosslagged effects can be fixed to be equal across time), twice as much information is used for the estimation of each mediation effect which results in more robust estimates. Third, longitudinal mediation effects are most often very small compared to zero-order regression effects because they are products of small effects to begin with. If one applies the correction procedure proposed earlier for crosslagged correlations, and if the product of initial correlation and autoregression of the dependent variable is .30 for both the predictor-to-mediator part and the mediatorto-outcome part, a standardized longitudinal mediation effect in terms of the product of the two standardized components is small if it is above .005 (.07.07 ¼ .0049), medium-sized if it is above .045, and large if it is above .123. Indeed, my experience with longitudinal mediation is that effect sizes above .10 are rare. Because longitudinal mediation effects are most often very small, very large samples are required to detect them with sufficient statistical power.

III. Methods and statistics

832

FIG. 5

31. Modeling developmental processes

A three-wave longitudinal mediation model (correlations between the residuals are not shown).

Conclusion Studies of long-term personality development have to carefully distinguish between individual, mean, and differential change (see Figs. 1 and 2). Differential development in terms of between-person differences in individual trajectories can be studied from a variable-centered perspective with powerful statistical models that have specific advantages and disadvantages (see Table 1). From a person-centered perspective, personality profiles in traits or environmental characteristics can be studied over time, either in terms of the stability of personality types or in terms of types of individual trajectories. The different individual trajectories can be explained with methods of varying strictness of control. The most powerful control is achieved with randomized controlled trials that are, however, difficult to apply broadly for explaining normal personality variation. Better

suited are natural experiments where a between-person variable later shows a difference between two groups defined by their genetic relatedness or a difference in their environment but initially was statistically unrelated to the groups. In both cases, the group difference is studied with a between-person design although the effects are based on within-person mechanisms. Therefore, they can explain between-person changes only to the extent that they are due to these within-person mechanisms. Statistically controlled prediction is the weakest design in terms of controlling unobserved variables but the strongest in terms of the range of applications. Care must be exercised not to control for consequences of the predictor (colliders). Crosslagged analysis and longitudinal mediation analysis based on a doubled crosslagged design can be considered a special case of statistically controlled prediction because they control for a preceding assessment of the

III. Methods and statistics

References

dependent variable (or a preceding change in it in the case of change-to-change effects model). They allow for disentangling the effects of two between-person variables on each other over development. Whereas it can be useful to control the person-means in studies of short-term change or within-person fluctuations with person-mean centering or similar procedures, such a control is highly problematic in the case of long-term change because it may take out an important part of the between-person variation on which the causal effects are based. More generally, the quality of the explanation of between-person differences in individual developmental trajectories depends not only on the design for predicting these differences and the specific statistical model used for data analysis but also on the adequate timing of the assessments and, last but not least, on the plausibility of the assumed causal mechanisms underlying the assumed effects. Fishing for causal effects without a priori investment of thought into possible causal mechanisms is a dangerous endeavor that will often result in uninterpretable or nonreplicable findings.

References Adachi, P., & Willoughby, T. (2015). Interpreting effect sizes when controlling for stability effects in longitudinal autoregressive models: Implications for psychological science. European Journal of Developmental Psychology, 12, 116–128. Asendorpf, J. B. (2012). Bias due to controlling a collider: A potentially important issue for personality research (comment). European Journal of Personality, 26, 391–392. Asendorpf, J. B. (2015). Person-centered approaches to personality. M. Mikulincer, P. H. Shaver, M. L. Cooper, & R. J. Larsen (Eds.), Handbook of personality and social psychology (pp. 403–424). Personality processes and individual differences: Vol. 4 (pp. 403–424). Washington, DC: American Psychological Association. Asendorpf, J. B., Borkenau, P., Ostendorf, F., & van Aken, M. A. G. (2001). Carving personality description at its joints: Confirmation of three replicable personality prototypes for both children and adults. European Journal of Personality, 15, 169–198.

833

Asendorpf, J. B., & Denissen, J. J. A. (2006). Predictive validity of personality types versus personality dimensions from early childhood to adulthood: Implications for the distinction between core and surface traits. Merrill-Palmer Quarterly, 52, 486–513. Asendorpf, J. B., & van Aken, M. A. G. (1999). Resilient, overcontrolled and undercontrolled personality prototypes in childhood: Replicability, predictive power, and the trait/ type issue. Journal of Personality and Social Psychology, 77, 815–832. Asendorpf, J. B., van de Schoot, R., Denissen, J. J. A., & Hutteman, R. (2014). Reducing bias due to systematic attrition in longitudinal studies: The benefits of multiple imputation. International Journal of Behavioral Development, 38, 453–460. Bainter, S. A., & Howard, A. L. (2016). Comparing withinperson effects from multivariate longitudinal models. Developmental Psychology, 52, 1955–1968. Batty, G. D., Deary, I. J., Schoon, I., Emslie, C., Hunt, K., & Gale, C. R. (2008). Childhood mental ability and adult alcohol intake and alcohol problems: The 1970 British Cohort Study. American Journal of Public Health, 98, 2237–2243. Berry, D., & Willoughby, M. T. (2017). On the practical interpretability of cross-lagged panel models: Rethinking a developmental workhorse. Child Development, 88, 1186–1206. Block, J. (1961). The Q-sort method in personality assessment and psychiatric research. Springfield, IL: Charles C Thomas. Block, J. (1971). Lives through time. Berkeley, CA: Bancroft Books. Block, J. (2008). The Q-Sort in character appraisal: Encoding subjective impressions of persons quantitatively. Washington, DC: American Psychological Association. Block, J. H., & Block, J. (1980). The role of ego-control and ego-resiliency in the organization of behavior. W. A. Collins (Ed.), Minnesota symposium on child psychology (pp. 39–101). Vol. 13(pp. 39–101). Hillsdale, NJ: Erlbaum. Bollen, K. A., & Curran, P. J. (2006). Latent curve models. A structural equation perspective. Hoboken, NJ: Wiley. Brummelman, E., Thomaes, S., Nelemans, S. A., de Orobio Castro, B., Overbeek, G., & Bushman, B. J. (2015). Origins of narcissism in children. Proceedings of the National Academy of Sciences of the United States of America, 112, 3659–3662. Brummelman, E., Thomaes, S., Nelemans, S. A., Orobio de Castro, B., & Bushman, B. J. (2015). My child is God’s gift to humanity: Development and validation of the Parental Overvaluation Scale (POS). Journal of Personality and Social Psychology, 108, 665–679. Caspi, A., Moffitt, T. E., Morgan, J., Rutter, M., Taylor, A., Arseneault, L., … Polo-Thomas, M. (2004). Maternal expressed emotion predicts children’s antisocial behavior

III. Methods and statistics

834

31. Modeling developmental processes

problems: Using monozygotic-twin differences to identify environmental effects on behavioral development. Developmental Psychology, 40, 149–161. Chapman, B. P., & Goldberg, L. R. (2011). Replicability and 40-year predictive power of childhood ARC types. Journal of Personality and Social Psychology, 101, 593–606. Chassin, L., Curran, P. J., Hussong, A. M., & Colder, C. R. (1996). The relation of parent alcoholism to adolescent substance use: A longitudinal follow-up study. Journal of Abnormal Psychology, 105, 70–80. Cohen, J. (1988). A power primer. Psychological Bulletin, 112, 155–159. Cole, D. A., & Maxwell, S. E. (2003). Testing mediational models with longitudinal data: Questions and tips in the use of structural equation modeling. Journal of Abnormal Psychology, 112, 558–577. Denissen, J. J. A., Asendorpf, J. B., & van Aken, M. A. G. (2008). Childhood personality predicts long-term trajectories of shyness and aggressiveness in the context of demographic transitions in emerging adulthood. Journal of Personality, 76, 67–100. Ferguson, C. J. (2010). A meta-analysis of normal and disordered personality across the life span. Journal of Personality and Social Psychology, 98, 659–667. Geiser, C., & Lockhart, G. (2012). A comparison of four approaches to account for method effects in latent state-trait analyses. Psychological Methods, 17, 255–283. Gollob, H. F., & Reichardt, C. S. (1987). Taking account of time lags in causal models. Child Development, 58, 80–92. Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549–576. Grimm, K. J., An, Y., McArdle, J. J., Zonderman, A. B., & Resnick, S. M. (2012). Recent changes leading to subsequent changes: Extensions of multivariate latent difference score models. Structural Equation Modeling, 19, 268–292. Hamaker, E. L., Kuiper, R. M., & Grasman, R. P. P. P. (2015). A critique of the cross-lagged panel model. Psychological Methods, 20, 102–116. Hayduk, L. A. (1987). Structural equation models with LISREL: Essentials and advances. Baltimore: The John Hopkins University Press. Hox, J. J. (2010). Multilevel analysis: Technique and applications (2nd ed.). New York, NY: Routledge. Hudson, N. W., & Fraley, R. C. (2015). Volitional personality trait change: Can people choose to change their personality traits? Journal of Personality and Social Psychology, 109, 490–507. Jung, T., & Wickrama, K. A. S. (2008). An introduction to latent class growth analysis and growth mixture modeling. Social and Personality Psychology Compass, 2, 302–317.

Kiff, C. J., Lengua, L. J., & Zalewski, M. (2011). Nature and nurturing: Parenting in the context of child temperament. Clinical Child and Family Psychology Review, 14, 251–301. Klimstra, T. A., Akse, J., Hale, W. W., III, Raaijmakers, Q. A. W., & Meeus, W. H. (2010). Longitudinal associations between personality traits and problem behavior symptoms in adolescence. Journal of Research in Personality, 44, 273–284. Kluckhohn, C., & Murray, H. A. (1961). Personality formation: The determinants. In C. Kluckhohn, H. A. Murray, & D. M. Schneider (Eds.), Personality in nature, society and culture (pp. 53–67). (2nd ed.). New York, NY: Knopf. Krahe, B. (2014). Media violence use as a risk factor for aggressive behaviour in adolescence. European Review of Social Psychology, 25, 71–106. Lee, J. J. (2012). Correlation and causation in the study of personality. European Journal of Personality, 26, 372–390. Little, T. D. (2013). Longitudinal structural equation modeling. New York, NY: Guilford Press. Magnusson, D. (1988). Individual development from an interactional perspective: A longitudinal study. Hillsdale, NJ: Erlbaum. Masten, A. S., Roisman, G. I., Long, J. D., Burt, K. B., Obradovic, J., Riley, J. R., … Tellegen, A. (2005). Developmental cascades: Linking academic achievement and externalizing and internalizing symptoms over 20 years. Developmental Psychology, 41, 733–746. Maxwell, S. E., & Cole, D. A. (2007). Bias in cross-sectional analyses of longitudinal mediation. Psychological Methods, 12, 23–44. Meeus, W., Van de Schoot, R., Klimstra, T., & Branje, S. (2011). Personality types in adolescence: Change and stability and links with adjustment and relationships: A fivewave longitudinal study. Developmental Psychology, 47, 1181–1195. Molenaar, P. C. M., & Campbell, C. G. (2009). The new person-specific paradigm in psychology. Current Directions in Psychological Science, 18, 112–117. Mund, M., & Neyer, F. J. (2014). Treating personalityrelationship transactions with respect: Narrow facets, advanced models, and extended time frames. Journal of Personality and Social Psychology, 107, 352–368. Neyer, F. J., & Asendorpf, J. B. (2001). Personality relationship transaction in young adulthood. Journal of Personality and Social Psychology, 81, 1190–1204. Neyer, F. J., Mund, M., Zimmermann, J., & Wrzus, C. (2014). Personality-relationship transactions revisited. Journal of Personality, 82, 539–550. Orth, U., & Robins, R. W. (2013). Understanding the link between low self-esteem and depression. Current Directions in Psychological Science, 22, 455–460. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models (2nd ed.). Thousand Oaks, CA: Sage.

III. Methods and statistics

References

Reitz, A. K., Motti-Stefanidi, F., & Asendorpf, J. B. (2016). Me, us, and them: Testing sociometer theory in a socially diverse real-life context. Journal of Personality and Social Psychology, 110, 908–920. Roberts, B. W., & DelVecchio, W. F. (2000). The rank-order consistency of personality traits from childhood to old age: A quantitative review of longitudinal studies. Psychological Bulletin, 126, 3–25. Roberts, B. W., Luo, J., Briley, D. A., Chow, P. I., Su, R., & Hill, P. L. (2017). A systematic review of personality trait change through intervention. Psychological Bulletin, 143, 117–141. Rutter, M. (2007). Proceeding from observed correlation to causal inference: The use of natural experiments. Perspectives on Psychological Science, 2, 377–395. Schaeffer, C. M., Petras, H., Ialongo, N., Poduska, J., & Kellam, S. (2003). Modeling growth in boys’ aggressive behavior across elementary school: Links to later criminal involvement, conduct disorder, and antisocial personality disorder. Developmental Psychology, 39, 1020–1035. Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton, Mifflin and Company. Simonton, D. K. (1990). Psychology, science, and history: An introduction to historiometry. New Haven, CT: Yale Univ. Press. Simonton, D. K. (1998). Mad King George: The impact of personal and political stress on mental and physical health. Journal of Personality, 66, 443–466.

835

Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and event occurrence. Oxford, UK: Oxford University Press. Specht, J., Egloff, B., & Schmukle, S. C. (2011). Stability and change of personality across the life course: The impact of age and major life events on mean-level and rank-order stability of the Big Five. Journal of Personality and Social Psychology, 101, 862–882. Specht, J., Luhmann, M., & Geiser, C. (2014). On the consistency of personality types across adulthood: Latent profile analyses in two large-scale panel studies. Journal of Personality and Social Psychology, 107, 540–556. Twenge, J. M., Konrath, S., Foster, J. D., Campbell, W. K., & Bushman, B. J. (2008). Egos inflating over time: A crosstemporal meta-analysis of the Narcissistic Personality Inventory. Journal of Personality, 76, 875–902. Tzambazis, K., & Stough, C. (2000). Alcohol impairs speed of information processing and simple and choice reaction time and differentially impairs higher-order cognitive abilities. Alcohol and Alcoholism, 35, 197–201. Voelkle, M. C., Oud, J. H., Davidov, E., & Schmidt, P. (2012). An SEM approach to continuous time modeling of panel data: Relating authoritarianism and anomia. Psychological Methods, 17, 176–192. Wilmoth, D. R. (2012). Intelligence and past use of recreational drugs. Intelligence, 40, 15–22. Wortman, J., Lucas, R. E., & Donnellan, M. B. (2012). Stability and change in the Big Five personality domains: Evidence from a longitudinal study of Australians. Psychology and Aging, 27, 867–874.

III. Methods and statistics

C H A P T E R

32 What if apples become oranges? A primer on measurement invariance in repeated measures research D. Angus Clarka and M. Brent Donnellanb a

b

University of Michigan, Ann Arbor, MI, United States Michigan State University, East Lansing, MI, United States

O U T L I N E Introduction

838

When invariance fails

850

What is measurement invariance?

838

The future of invariance testing

851

A method for testing invariance

844

Conclusion

852

Advanced techniques for testing invariance 847

Acknowledgments

852

The practical impact of noninvariance

References

853

848

researchers using intensive longitudinal measures. Conceptual issues are emphasized and a brief simulation study is used to illustrate concerns with a lack of MI. Standard techniques for evaluating MI with structural equation modeling approaches are described. The chapter concludes with an overview of more advanced methods for assessing MI, computing effect sizes, and estimating the practical impact of MI on scientific inferences.

Abstract Measurement invariance (MI) is a critical issue whenever scores are compared in research. MI is especially relevant when analyzing repeated measures data— regardless of whether there are 2 measurement occasions or 200. In such instances, evaluating MI amounts to testing whether scores at one time point are psychometrically equivalent to scores at a different time point. This chapter provides a primer on MI for repeated measures studies using a simple 2-wave illustration. The objective is to provide a foundation for considering MI when working with repeated measures data of the sort collected by personality

The Handbook of Personality Dynamics and Processes https://doi.org/10.1016/B978-0-12-813995-0.00032-7

Keywords Measurement invariance, Differential item functioning, Psychometrics, Repeated measures, Effect size

837

# 2021 Elsevier Inc. All rights reserved.

838

32. Primer on measurement invariance

Introduction Psychological researchers are often interested in abstract concepts such as anxiety, anger, or extraversion. These constructs usually lack a single “gold standard” of measurement, so researchers use a variety of approaches to quantify these variables. For example, asking people to self-report their thoughts, feelings, or behaviors using survey items is an especially common method for obtaining data (e.g., Clark & Watson, 2019; Lucas & Baird, 2006; Simms, 2008). However, researchers are generally aware that any single method of assessment has limitations. Thus, researchers typically understand that observed measures are an imperfect reflection of psychological variables. Likewise, researchers are generally aware that psychometric properties may differ across time or across groups, at least at an abstract level. The worry that psychometric properties may change over the course of a study is easy to see in developmental research where the same questions may not work in the same way when comparing responses across different periods of the lifespan. For example, the degree that an item asking about “fears of the dark” indexes pathological anxiety may change when comparing reports from children versus adults. That is, the same objective response of “3” to such a survey item implies a different level of the underlying latent trait of anxiety for six-year olds compared to twenty-six year olds. A similar concern about measurement might happen when researchers repeatedly ask questions about anxiety in daily surveys used to study personality processes or personality dynamics. Responding to the same set of questions in repeated succession may impact response processes. Accordingly, a score of 3.5 on the first occasion may not refer to the same level of that latent variable as a score of 3.5 on the fifth occasion. The possibility that measures may function differently across time or in different groups is the crux of the concept of “measurement

invariance” (MI; Millsap & Olivera-Aguilar, 2012). Entire literatures are devoted to this topic, so this chapter is intended as a primer. The goal of this chapter is to introduce the MI literature to researchers who are primarily interested in studying personality dynamics using intensive longitudinal designs. We use a simple motivating example of a two-occasion design to illustrate basic concepts and applications. This approach should provide researchers with a foundation for considering these issues in more technically demanding situations with many measurement occasions.

What is measurement invariance? A preliminary note on terminology is warranted. We refer to measurement invariance (MI) in this chapter and a near-analogous concept that will often be encountered in the psychometric literature is “differential item functioning” (DIF; Osterlind & Everson, 2009). MI and DIF refer to the same underlying issue—although worded in opposite directions—and the existence of two distinct literatures and terminologies is somewhat of a historical accident (Bauer, 2017; de Ayala, 2009). In the structural equational modeling (SEM) tradition, psychometric equivalence has historically been discussed under the label MI, whereas in the item response theory (IRT) tradition, the label DIF is used. Both traditions are focused on the psychometric equivalence (or lack thereof ) of scores across different points of comparison (e.g., time points or person groups). This chapter is written from the perspective of an MI/SEM framework, and therefore terminology from that literature is used throughout (e.g., weak and strong invariance versus uniform and nonuniform DIF). Still, tools and concepts from the DIF literature are cited when relevant. Relatedly, we use the somewhat cumbersome phrase measurement noninvariance (MNI) to refer to a situation where

III. Methods and statistics

What is measurement invariance?

measurement invariance is not evident. The term DIF also generally captures this phenomenon (Bauer, 2017), but the reference to item functioning may be too narrow for present purposes given that we focus on the broader idea of having multiple indicators of a latent variables. Test and survey items are but one kind of indicator. The term MNI captures nonequivalence in how indicators such as items, composite scales, or parcels function across groups or time (e.g., Clark, Durbin, Donnellan, & Neppl, 2017), not just items per se. In the broadest sense, MI refers to psychometric equivalence across populations or measurement occasions (Millsap & OliveraAguilar, 2012). Put simply: does a proposed indicator (e.g., an item or scale score) have the same connection to an underlying latent variable (i.e., the unobserved variable of interest) at the first and second (and third, etc.) occasions of a repeated measures study? This question can be formalized a bit more by asking whether the indicator has the same regression weight for indexing the corresponding latent variable (i.e., the same factor loading) at each occasion of a repeated measures study. Questions about psychometric equivalence can be extended even further by asking whether an indicator also has the same intercept at each occasion of a repeated measures study (i.e., is the expected value for that indicator for someone with an average score on the latent variable the same at all occasions?). These particular questions capture specific types of MI. For example, questions about consistency of factor loadings are often referred to as addressing weak or metric invariance, whereas questions about the consistency of intercepts are often referred to as addressing strong or scalar invariance. Related terms are configural invariance, which addresses the equivalence of model form (e.g., number of factors and cross-loadings) across occasions, and strict invariance, which addresses equivalence in the structure of the residuals (i.e., residual variances and

839

covariances) across occasions. All these terms are described more fully here. A number of psychometric considerations are relevant to measurement invariance and these considerations are often grounded in terminology associated with confirmatory factor analysis (CFA; Brown, 2015). We start with an example to ground our discussion. Fig. 1 depicts a basic latent variable model for two waves of repeated measures data where a construct (e.g., Anxiety) is measured with five indicators. Items are used in this example but the concept would apply as well if five separate scales were used as indicators. The latent variables (the circles in Fig. 1) refer to scores on abstract concepts at both time points that are inferred based on responses to actual the indicators (the boxes in Fig. 1). The indicators (i.e., the observed variables or item responses in this example) are associated with a latent variable by a series of regression weights or factor loadings (the lambdas [λs] in Fig. 1). These weights reflect how strongly indicators relate to the latent variable. Higher loadings in such reflective models are usually interpreted to mean that an indicator is a better representation of the construct in question. The fact that factor loadings are allowed to vary across indicators within the same time point means that not all indicators are equally good representations of constructs. This is a reasonable assumption for many personality applications. Fig. 1 depicts a relatively simple example where longitudinal MI can be evaluated across two occasions. Covariances (double-headed arrows) are specified between the two latent factors, and covariances are specified for the residual variances for each indicator over time in Fig. 1. Both of these sets of covariances reflect that associations likely exist across occasions (i.e., autocorrelation). Researchers are typically interested in the stability of the latent variables themselves over time, but the indicator-specific variances may show autocorrelations as well. The existence of indicator-specific autocorrelation reflects the possibility of stability in the

III. Methods and statistics

840

32. Primer on measurement invariance

FIG. 1 Two-wave confirmatory factor-analytic model. I, indicator; R, residual; λ, factor loading; τ, intercept; ε, residual variance; σ 2, factor variance; μ, factor mean. First subscript denotes indicator number, second subscript denotes occasion number. The triangle is a constant, representing the inclusion of the mean structure.

unique elements of indicators overtime. Thus, both sets of covariances are displayed in Fig. 1. Random measurement error is by definition unique to a specific occasion and does not show autocorrelation. It is possible to write a simple equation for each indicator based on the logic depicted in Fig. 1. Consider the survey item “My heart was racing.” Researchers can formalize the degree to which responses to this survey item reflects the latent construct of anxiety. That equation could be written as:  Heart Racing response ¼ τ + λ Anxiety + ε (1)

Questions about MI generally boil down to questions about the extent to which whether the three parameters in Eq. (1) (τ, λ, and ε) vary across time (or groups). So, we can take Eq. (1) and make it time-specific:  Heart RacingTime 1 ¼ τTime1 + λTime1 Anxiety + εTime1

Here, τ is an intercept (the expected response when the latent Anxiety variable is 0), λ is the factor loading (the expected change in the indicator for a 1-unit change in latent Anxiety), and ε is a residual term (variance in the Heart Racing indicator that is not shared across the other indicators intended to measure Anxiety). Residual terms are represented by the circles labeled “R” in Fig. 1.

Eq. (2) represents the model for the heart rate indicator at Time 1, and Eq. (3) represents the model for the heart rate indicator at Time 2. Tests of MI usually occur in a sequence of analyses that impose increasingly more rigid requirements on psychometric properties. The first and most foundational type of measurement invariance is called configural invariance (CI). CI concerns whether the basic form of the

(2) Heart RacingTime 2 ¼ τTime2 + λTime2 Anxiety + εTime2



(3)

III. Methods and statistics

What is measurement invariance?

measurement model is constant across time and is required for further evaluations of MI to proceed. CI is implied in Fig. 1 because this figure was specified to show that the same five observed variables serve as indicators of latent Anxiety at both waves. If, for example, preliminary exploratory factor analyses suggested a single factor solution was optimal at Time 1 whereas two factor solution was optimal at Time 2, configural invariance would be rejected. In such a case, the construct itself would appear to be qualitatively different across time, making scores on a composite of the five indicators fundamentally incomparable across time. The set of indicators measure one attribute at Time 1 but two attributes at Time 2. No further tests of measurement invariance for this set of indicators should occur in such a case. Comparing scores across time in this situation would amount to comparing apples and oranges because the nature of the underlying construct is different at Time 1 and Time 2. Assuming configural invariance is present, tests of weak or metric invariance (WI) focus on whether factor loadings are equivalent across time (λTime1 ¼ λTime2 using elements of Eqs. 2, 3). WI concerns whether each indicator has the same connection to the underlying latent variable at both time points. Following tests of WI are tests of strong or scalar invariance (SI), which focus on whether indicator intercepts are equivalent across time (τTime1 ¼ τTime2 using elements of Eqs. 2, 3). SI tests whether the indicators have the same expected values for people with the same level of the latent value at each occasion. Last, tests of strict or residual invariance (RI), which focus on whether residual variances are equivalent across time (εTime1 ¼ εTime2 using elements of Eqs. (2), (3); the equivalency of residual covariances across time can also be tested if such paths are present). The degree to which a particular level of invariance is satisfied has implications for what analyses can confidently be performed using indicator composites such as summed or

841

averaged responses to items (Millsap & Olivera-Aguilar, 2012). The psychometric implications of different levels of invariance and noninvariance are illustrated in Fig. 2, which depicts responses to the heart racing item as a function of the latent Anxiety dimension across two time points. MI visualizations like this are used in other didactic pieces such as in Brown (2015, see his Fig. 7.4 on page 256). The top left panel of Fig. 2 depicts a totally invariant item. Each 0 to 6 response to the heart racing item (the Y axis) reflects the same level of the latent Anxiety variable at both time points (the X axis). The two lines are superimposed upon each other. A lack of WI (with SI present) is illustrated in the top right panel of Fig. 2. Although the Anxiety 0-point responses are equivalent across time, as Anxiety increases, the responses at the two time points diverge. In other words, a response of 3 refers to a different level of Anxiety at Time 1 and Time 2. Thus, a lack of WI entails a rapid discordance between responses. If WI is not present, researchers should not compare means across occasions and should not make comparisons of the correlates of observed scores across time. A lack of SI (with WI present) is illustrated in the bottom left panel of Fig. 2. When Anxiety is at its 0 point, responses to the heart racing indicator are different at the two time points. Even though the slope is equal, responses are still consistently discordant because of the unequal intercepts. If SI is not supported, researchers should not compare means across occasions. If WI is present but SI is not, the covariance (but probably not correlation; Millsap & OliveraAguilar, 2012) structure can be compared over time, but caution may still be warranted. The bottom right panel of Fig. 2 depicts a lack of both WI and SI. In this situation, responses to the heart racing indicator are never equivalent at any point along the Anxiety dimension. In other words, the same observed responses on the indicator consistently refer to different levels of Anxiety at both time points. Fig. 2 highlights why MI

III. Methods and statistics

842

32. Primer on measurement invariance

6

6 Time 1

5

Time 2

4

Item response

Item response

5

3 2 1 0

4 3 2 1

0

1

2

3

4

5

0

6

0

1

2

6

6

5

5

4 3 2 1 0

3

4

5

6

4

5

6

Anxiety

Item response

Item response

Anxiety

4 3 2 1

0

1

2

3

4

5

6

0

0

Anxiety

1

2

3 Anxiety

FIG. 2 Expected responses to an item on an anxiety questionnaire administered at two time points for different levels of invariance. Levels of Anxiety are plotted on the X axis, and predicted responses to a specific question (Heart Racing) are plotted on the Y axis. The top left panel depicts full invariance, the top right panel depicts the absence of weak invariance and the presence of strong invariance, the bottom left panel depicts the absence of strong invariance and the presence of weak invariance, the bottom right panel depicts the absence of both weak and strong invariance.

is so critical. Without considering MI, comparisons based on the observed data without any consideration of psychometric concepts like factor loadings and intercepts may simply reflect artifacts of measurement rather than true differences. A simulation based on this example further demonstrates the risks (for more complete details, see https://osf.io/crw58/). The model in Fig. 1 was used as the underlying population model with 1000 observations at each time point. For simplicity’s sake, factor loadings, intercepts, and residual variances were held constant within a factor. Population parameter values are presented in Table 1. Specifications were such that the latent variables and indicators were on a standard metric, with indicator scores

lower at Time 2 (i.e., equivalent levels of latent Anxiety will result in lower responses). Of note, we specified the model such that there was no true mean change across time (i.e., on average people have the same level of Anxiety at Time 2 as at Time 1), but more variability in anxiety at Time 2, with a moderate correlation between time points. Such a simulation is revealing because we can examine to what extent the stipulated aspects of the data (e.g., no mean change across time) are uncovered under different conditions of MI. First, consider a population where only CI holds; factor loadings, intercepts, and residual variances all vary across time. Assuming invariance by estimating a model with loadings, intercepts, and residual variances fixed

III. Methods and statistics

843

What is measurement invariance?

TABLE 1

Consequences of incorrectly assuming invariance. Configural invariance

Parameter

Weak invariance

Strong invariance

Strict invariance

Pop

Est

Pop

Est

Pop

Est

Pop

Est

λ

.75

.73

.75

.73

.75

.73

.75

.75

τ

.00

.00

.00

.00

.00

.00

.00

.00

.44

.60

.44

.60

.44

.60

.44

.44

σ

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

μ

.00

.00

.00

.00

.00

.00

.00

.00

λ

.50

.73

.75

.73

.75

.73

.75

.75

τ

1.50

.00

1.50

.00

.00

.00

.00

.00

.75

.60

.75

.60

.75

.60

.44

.44

σ

2.00

1.00

2.00

2.18

2.00

2.18

2.00

2.00

μ

.00

2.06

.00

2.06

.00

.00

.00

.00

Factor

.30

.21

.30

.32

.30

.32

.30

.30

Residual

.20

.20

.20

.20

.20

.20

.20

.20

Age 5

ε 2

Age 17

ε 2

Covariances

Note. λ, factor loading; τ, intercept; ε, residual variance; σ 2, factor variance; μ, factor mean; Pop, population value; Est, average estimate across replications. All simulations included N ¼ 1000; k ¼ 1000 replications were run for each condition. All factor loadings, intercepts, and residual variances were equivalent within a factor, and were constrained to equality both within and between factors in the estimated models (i.e., the estimated models imposed invariance). For full details, see https://osf.io/crw58/.

to equality across time (this is the effective assumption when computing raw composite scores) leads to some highly distorted conclusions (Table 1). The correlation across time is smaller than in reality, there is no evidence for any increase in variability, and most troubling for the analysis of absolute change over time, there appears to be a mean decline in anxiety from Time 1 to Time 2. Consider then a situation in which WI holds, but SI and RI do not. Here, the correlation between time points is reproduced much more accurately, and the variance increase over time is also captured well. However, there is still evidence of a decrease in Anxiety over time. Next, consider the situation in which both WI and SI

hold, but RI does not. Here, the correlation and variance difference are again reproduced. Additionally, there is now no evidence of a mean decline over time. The presence of RI in addition to WI and SI therefore serves to further reduce some of the remaining bias, for example, in factor loadings, but otherwise does not seem to have much of a practical effect. This simulation is intended as a somewhat dramatic example. Further, it is unlikely that every indicator is as considerably noninvariant as the stipulations used in the foregoing illustration. Nonetheless, these simulations still emphasize the fundamental issue regarding the importance of invariance testing: MI left undetected and unaddressed has the potential to produce

III. Methods and statistics

844

32. Primer on measurement invariance

severely distorted conclusions. For a less contrived example, take Wicherts et al.’s (2004) investigation of MI in cognitive ability scores. They found evidence that several cognitive ability assessments are not invariant across cohorts, implying that the Flynn Effect—a mainstay of introductory psychology textbooks—may partly be a reflection of noninvariance of intelligence scores for different cohorts. In other words, some of the Flynn Effect might amount to an artifact of measurement rather than true changes in the cognitive ability of successive generations. Now that we have explained the importance of measurement invariance, we move to an explanation of how to test invariance using common analytic tools.

A method for testing invariance MI can be evaluated via the estimation and comparison of progressively more restricted confirmatory factor analytic models. First, a configural model is estimated as a baseline. In this model, the same indicators load on the same factors across time (or groups). However, the factor loadings, intercepts, and residual variances are allowed to vary across time (or groups). A typical configural model for two time points is depicted in Fig. 1. In this baseline model, the scale of the factors needs to be fixed to identify the model. In the SEM tradition, this is usually accomplished by selecting one indicator to serve as a reference indicator. A reference indicator has its factor loading and intercept fixed to 1 and 0, respectively. These constraints scale the latent factor on the same metric as the reference indicator at both time points (or across groups). Notably, the reference indicator has invariance imposed from the outset and cannot be tested for invariance itself. Thus, the selection of the reference indicator must therefore be intentional, and methods exist to guide this process (e.g., Jung & Yoon, 2017). MI can of course be tested multiple times

with different reference indicators, but this can become time intensive and cumbersome. Our advice is to pick reference indicators carefully. An inspection of the indicator total correlations (e.g., item total correlations in an internal consistency analyses) and indicator means (and variances) at both time points can help. We suggest selecting an indicator with a strong correlation with the remaining indicators at each occasion and a level of consistency overtime that compares favorably to the other indicators. Also, one should pick an indicator with similar means and variances at each occasion. As an aside, models from the IRT tradition typically avoid this problem of picking referent indicators by fixing the mean and variance of the latent factors themselves to 0 and 1 (Tay, Meade, & Cao, 2015). This specification puts the factors on a standardized metric so that the mean and variance differences in the configural model thus cannot be examined. However, this allows all items to be tested for MI. As subsequent constraints are added to the model to test invariance, factor means and variances can be freed for all factors except for the reference factor. All other factor means are interpreted in relation to the reference factor (e.g., a Time 2 factor mean of .50 denotes that the average value at Time 2 is half a standard deviation smaller than the average value at Time 1). Configural models based on either the reference indicator or standardized factor approach are equivalent and will fit the data to the same degree. After the configural model has been estimated, and a baseline model fit is established, MI testing proceeds along a fairly standard course. First, constraints on the corresponding factor loadings are imposed across the two occasions. This tests for the presence of weak invariance (WI) or regression weight invariance. For instance, the factor loading for indicator 2 at Time 1 is constrained to the same value for the factor loading for indicator 2 at Time 2 using Fig. 1 as a guide. Further, if the standardized factor specification was used, the variances of all

III. Methods and statistics

A method for testing invariance

factors except one should be freed; otherwise, the model becomes one in which equality in the factor loadings and factor variances is being tested. (Typically, the variance of the latent factor at Time 1 is fixed and the variances of subsequent waves are freely estimated.) The fit of the WI model is then compared to the configural model. If the model fit is not significantly worse, researchers conclude weak invariance has been satisfied: indicators have the same factor loadings at Time 1 and Time 2. If the WI constraints are supported, constraints on the corresponding indicator intercepts are next imposed. This tests for the presence of strong or scalar invariance (SI). If the standardized factor specification was used, the means of all factors except one should be freed, otherwise the model becomes one in which equality in the intercepts and factor means is being tested. The fit of this SI model is then compared to the fit of the WI model. Generally speaking, if WI is not supported, there is no need to estimate the SI model. As illustrated in Fig. 2, equivalent intercepts in the absence of equivalent slopes are not particularly useful. Finally, if SI is supported, constraints on the corresponding residual variances are imposed over time and model fit is compared. This tests for residual invariance (RI) or strict invariance. Altogether, this is a fairly straightforward multistep procedure. However, there are at least two major details that warrant further discussion: how nested models are compared, and how a lack of support for MI at a given level is treated. In the context of invariance testing, comparisons in fit have historically been based on the statistical significance of the chi-square difference test (Δχ2; West, Taylor, & Wu, 2012). When models are nested, the difference between models tends to follow a chi-square distribution, with as many degrees of freedom (df) as the difference in df between models. A significant value is interpreted to mean that the constraints of the more restricted model have produced a statistically significant drop in fit. This drop in

845

fit is undesirable, and thus the constraints are rejected. The more restrictive level of measurement invariance is seen as implausible in light of the current data. The chi-square test of model fit has some limitations, however. The most basic one is that larger sample sizes can magnify misfit such that even inconsequential misspecifications will cause models and constraints to be rejected. This has led to the development of many alternative fit indices (West et al., 2012). Although the chisquare difference test may be somewhat less problematic than the chi-square test of absolute fit, there have still been efforts develop alternative tests of comparative fit (Meade, Johnson, & Braddy, 2008). The two most popular alternatives are based on examining changes in the root mean square error of approximation (RMSEA) and the comparative fit index (CFI) across different models. Lower RMSEA values denote better fit, whereas higher CFI values indicate better fit. Accordingly, comparatively higher RMSEA values and lower CFI values between less and more restricted models could potentially be used to indicate a meaningful drop in model fit. Although the idea of comparing RMSEA and CFI values might seem attractive in light of concerns with the chi-square tests, the complication is figuring out thresholds that indicate when a specification triggers an unacceptable level of misfit. Multiple thresholds in this vein have been proposed, and such thresholds are typically based on fairly simple models (Chen, 2007; Cheung & Rensvold, 2002; Meade et al., 2008). These realities, paired with the general problems associated with fit thresholds (Clark & Bowles, 2018; Markand, 2007; Marsh, Hau, & Wen, 2004; Saris, Satorra, & van der Veld, 2009; Steiger, 2007), suggest that although changes in the RMSEA and CFI can be useful, they should be used thoughtfully and with an understanding that any threshold used for decision making purposes is arbitrary at some level. Critical researcher judgment is needed and perhaps even a priori exposition of decision-making

III. Methods and statistics

846

32. Primer on measurement invariance

heuristics so they are not dependent on the observed results and fashioned post hoc. Combining the change in chi-square test with the change in RMSEA and CFI likely offers the most holistic representation of comparative fit; if all three indicators suggest a nontrivial drop in fit, constraints may be rejected. What, then, is done when the comparative fit tests reject some set of constraints? A researcher could conclude that the measure is noninvariant and abandon the overarching research endeavor. Such a researcher would be resigned to the fact that all conclusions will be marred by measurement noninvariance. Few researchers will elect to take this route. When one set of constraints is rejected, tests of partial invariance (PI) often follow. The idea of PI is that although some indicators may be noninvariant, others are not (Millsap & Olivera-Aguilar, 2012). Recall that the previous tests of weak invariance, strong invariance, and strict invariance are omnibus tests. A failure to satisfy any step in this process is a signal that something is amiss with at least one of multiple indicators at each step. The tests do not tell the research which indicators are problematic just as an omnibus F statistic in a five-group ANOVA comparison does not tell the researcher which group means are different. Tests of PI thus entail trying to identify the indicators that are primarily contributing to reductions in overall model fit. Several approaches are available to identify those indicators. Examining the modification indices of the more constrained model will often provide some sense of which constraints are driving misfit. An examination of the parameter estimates of interest in the baseline model can prove extremely useful. For example, if the heart racing indicator had the largest intercept value relative to the other items at Time 1 and the smallest intercept value relative to the other indicators at Time 2, there is a good chance it may be noninvariant. Indicator parameters can be freed or constrained one at time as well, with those models compared to the previous model,

in an attempt to identify the indicators most contributing to misfit. If there are many indicators, they can be constrained in “chunks” in an effort to hone in on the most problematic indicators. The IRT program flexMIRT (Cai, 2012) provides an automated feature (“DIF Sweep”) in which every item is individually tested for DIF using a recently developed method (Woods, Cai, & Wang, 2013). Although very efficient, this approach may be overly conservative— identifying more DIF than there really is—, and so should be followed up with more targeted tests (e.g., based on anchor items identified in the initial DIF sweep). When the indicators driving noninvariance are identified, they should have their parameters unconstrained, and future tests of noninvariance should build upon the most-justifiably constrained PI model. For example, if an indicator demonstrated differential factor loadings across time, its intercepts would not be constrained in the subsequent tests of SI. Notably, examinations of PI at the level of RI are somewhat rare. WI and SI are primarily needed for most analyses researchers conduct, with RI typically being very difficult to achieve and perhaps not important for many research applications (Millsap & Olivera-Aguilar, 2012). As illustrated in the simulations, WI and SI in the absence of RI did not typically bias conclusions in any meaningful way (Table 1). The testing framework described in this section is a useful starting point for the evaluation of MI. However, the general approach we outlined has some limitations, especially in the context of more intensive longitudinal designs. The nested model CFA approach quickly becomes unwieldy as more occasions of measurement (or groups) are added. There is also evidence that stepwise model comparisons to identify optimal PI models carries a nontrivial risk of misidentifying sources of noninvariance, which under certain conditions can lead to conclusions as biased as if simple observed scores were used (Marsh et al., 2018). Finally, nested model

III. Methods and statistics

Advanced techniques for testing invariance

comparisons based exclusively on comparative model fit are plagued by many of the traditional shortcomings of null hypothesis significance testing. Most relevant here is the issue of statistical versus practical significance. Just because certain indicators demonstrate problems with invariance, overall results may not necessarily be distorted to a meaningful degree. In the next two sections, contemporary techniques for detecting invariance that address some of these shortcomings are described, and then different methods of quantifying the practical impact of invariance are presented.

Advanced techniques for testing invariance The first recent technique of note is known as the “alignment” method (Asparouhov & Muthen, 2014; Flake & McCoach, 2018; Marsh et al., 2018; Muthen & Asparouhov, 2014). This approach was developed, in part, to address the complications that arise when assessing MI across many groups, and also to help avoid the haphazard nature of stepwise testing procedures when investigating partial invariance. In the alignment method, a procedure similar to rotation in EFA is applied to the configural model. In EFA, initial factor solutions that can be hard to interpret are rotated to produce a “simple structure” in which factor loadings are either particularly high or particularly low for a given factor (MacCallum, 2009; see Sass & Schmitt, 2010). This increases interpretability, and the same basic logic underlies the alignment method. That is, in an attempt to minimize a simplicity/loss function, the configural model is “rotated” to an optimal point such that the resulting model contains a few indicators that are exceptionally noninvariant across groups (or time points), while the remaining indicators demonstrate only negligible MNI. This is based in part on the extent to which a given indicator’s parameter estimate

847

in one group differed from the average estimate of that parameter across groups. The end result is a model with equivalent fit to the original configural model in which the largest sources of MNI have been isolated. The alignment method can effectively highlight which indicators are most invariant, and across which groups or waves of assessment. It is even possible to embed the optimal “aligned” solution into broader structural models. Simulation evidence supports the utility of the alignment method, and suggests that it will often perform better than stepwise nested model comparisons. Furthermore, it is fairly straightforward to implement the alignment method in the popular Mplus software package (see Muthen & Asparouhov, 2014 for sample syntax). There are some current limitations to this method, however. The most relevant shortcoming is that the alignment method was originally designed with group comparisons such as comparing scores from multiple racial/ethnic groups. It can be used to assess longitudinal invariance, but the data have to be reformatted in a way that will often be inconsistent with later modeling strategies. Beyond the mild inconvenience of reformatting the data, this precludes the direct incorporation of optimally rotated alignment models into larger models, such as a second-order latent growth curve model (Marsh & Mu´then, personal communication, 2017). Another promising technique is moderated nonlinear factor analysis (MNLFA; Bauer, 2017; Bauer & Hussong, 2009; Curran et al., 2014). This method has the potential to examine MI across multiple dimensions at once and accommodate both categorical and continuous predictors of MI. Investigations of MI are typically predicated on the existence of a discrete grouping of respondents and time points so that a handful of distinct measurement models can be specified and compared. The MNLFA takes a different approach in that only one measurement model is specified, but the major

III. Methods and statistics

848

32. Primer on measurement invariance

parameters of the model are constrained as a function of the potential sources of noninvariance. That is, factor loadings, intercepts, and residual variances (as well as factor variances and means) are not specified as individual parameters to be estimated per se, but as outcome values in a regression equation that includes all hypothesized predictors of noninvariance. For example, factor loadings may be specified as a function of both time of day (using a 24-h clock), and participant sex (coded as 0 and 1 for male and female). The expected value of a factor loading in this model is then given by the regression equation:  λ ¼ b0 + b1 time of day + b2 ðsexÞ, (4) where b0 is an intercept that captures the factor loading for women at midnight (that is, hour 00); b1 is a regression coefficient that captures the extent to which a change in the hour of the day is related to a change in the factor loading; and b2 is a regression coefficient that captures the extent to which the factor loading differs across women and men. The presence or absence of MI along some dimension is evaluated by examining the strength of the regression coefficients. The broad applicability of this method, and the ease of its interpretation, make it a potentially useful tool in many research contexts and especially intensive longitudinal designs (at least in the abstract). However, the original authors note that enthusiasm should be somewhat tempered as this is still a fairly novel approach. Accordingly, certain issues are still being worked out (e.g., the handling of multiple covariance parameters, the optimal number and nature of anchor items; Bauer, 2017), and further performance evaluations are needed. Still, this is a promising avenue of development, especially in the context of intensive longitudinal designs. The MNLFA method should be implementable in most SEM programs (see Bauer, 2017 for sample syntax).

The practical impact of noninvariance Beyond the simple detection of noninvariance, methodologists have started to identify effect sizes for quantifying the degree to which measurement variability across groups or time is a concern. These tools and techniques help researchers quantify the practical impact of any noninvariance that was detected using other methods. One of these tools is the Cohen’s d for Mean and Covariance Structure Analysis (dMACS; Nye & Drasgow, 2011). The dMACS is an indicator-level Cohen’s d effect size for the area between two measurement-model implied regression lines. Returning to the bottom left quadrant of Fig. 2, a graph is depicted of the expected responses for two time points across the construct of interest when WI holds, but SI does not. The dMACS is a quantification of the area between those two regression lines, and is expressed in the standard metric of a Cohen’s d, which facilitates interpretation. That is, these effect sizes can be interpreted using the typical heuristics for what constitutes a small (d < .20), medium (.21 < d < .49), and large (d > .50) effect size. For example, the dMACS for the bottom left quadrant of Fig. 2 is approximately d ¼ 1.4—a large effect—which highlights how a rather dramatic illustration was presented for didactic purposes. A couple of caveats are relevant when interpreting the dMACS. The first relates to the standard notion that effect size thresholds for small, medium, and large, are somewhat arbitrary. The second is that dMACS effect sizes capture the impact of MNI at the indicator level. A lack of invariance at the indicator level will not always exert a major influence on higher levels of analysis (e.g., the factor level), especially if different kinds of invariance at lower levels effect cancel out when aggregated (i.e., a large dMACS does not always entail meaningfully biased conclusions for composites depending on the context and other indicators). In other words, indicators can be invariant in opposite ways such that the

III. Methods and statistics

The practical impact of noninvariance

overall impact of such errors at the level of a composite are rather minimal when summed over many indicators. Thus, researchers need to evaluate dMACS results across all indicators and with an eye toward aggregated impact. Descriptive statistics associated with dMCAS are calculated with a specialized program (see Nye & Drasgow, 2011). One benefit of this program is that it also provides effect size estimates at the latent factor level as well. Specifically, the program provides the expected differences between test means and variances due to noninvariance, which can be used in conjunction with the observed means and variances to compute the approximate proportion of the observed differences that are primarily a reflection of noninvariance. For example, Clark et al. (2017) used this method in an examination of measure equivalence across children’s gender in a child temperament survey and found that around 36% of the observed gender difference in Negative Affectivity was likely due to noninvariance, and not actual gender differences. One major advantage of the dMACS program is that it relies on input from the configural model. Accordingly, these effect size estimates are not reliant on stepwise PI tests that may have gone awry, and present a holistic representation of MNI’s impact across loadings, intercepts, and residual variances. Two disadvantages, however, are that currently the program can only be used with dichotomous or continuous indicators, and that only two groups or time points can be compared at once. Another approach is referred to as the latent trait estimation effect size (Tay et al., 2015), and effectively provides a Cohen’s d for the overall effect of noninvariance on mean comparisons at the construct/test level. This approach requires first identifying the optimally constrained PI model, which provides the factor mean difference(s) after adjusting for noninvariance (i.e., the “true” difference). The idea is to compare this mean difference to the mean difference in a model that does not take noninvariance

849

into account which is a model with equality constraints on all factor loadings, intercepts, and residual variances across time. As noted, this effectively imposes the assumptions of sum or average scores computed with observed variables. The difference between the mean differences of the PI and fully constrained model are then computed. Provided the latent factors were identified by setting the reference occasion (e.g., Time 1) mean and variance to 0 and 1 (instead of using reference indicators), respectively, the difference between the PI and fully constrained model can be interpreted as a Cohen’s d for the impact of MNI at the level of mean comparisons across time (or groups). Millsap and Olivera-Aguilar (2012) present a similar method that applies to the indicator level. Differences in intercepts in the PI model are compared to observed mean differences in the indicators, and this provides a sense of the extent to which mean differences at the indicator level reflect noninvariance versus actual differences in the latent construct. The general logic of these analyses can be extended to other situations as well. That is, comparing parameter estimates in measurement or structural models with and without full invariance imposed can be a useful way to get a sense of how noninvariance may affect major conclusions. The advantages of these approaches include quick, easily interpretable effect sizes that concisely present a broad summary of how noninvariance may impact conclusions. Further, it is easier to incorporate more than two groups or time points at once than in the dMACS program. On the other hand, this method is reliant on the identification of a mostly accurate PI model, and it will still become onerous when there are a large number of time points. The effect size metrics described thus far are based on the standard MI testing paradigm. However, the more recent approaches for identifying MI discussed in this section also tend to incorporate methods of quantifying the practical

III. Methods and statistics

850

32. Primer on measurement invariance

effect of noninvariance. In the alignment procedure, users are given information regarding the extent to which each indicator’s parameter estimates contributed to the simplicity/loss function. This shows how indicators’ parameter estimates in one group differed from the average estimate of that parameter across all groups, which can be used as an effect size insofar as understanding the size and direction of this difference can provide insight on how the assumption of MI will potentially bias conclusions. Furthermore, the entire point of MNLFA is to provide regression coefficients that both indicate the size and direction of any MNI effects. These latter two methods can also be used in conjunction with those described earlier in this section. For example, the parameter estimates from a particularly discrepant group in the alignment method can be used as one group in the dMACS program, while the aggregate estimates can be used as the parameter estimates from the other group. This could provide a dMACS estimate for the magnitude to which that indicator differs from one group to the rest. Also, structural models could potentially be estimated both with and without the MI regression coefficients of the MNLFA fixed to 0.

When invariance fails All but the most underpowered investigations of MI will likely detect at least some noninvariance for some indicators. The presence of MNI in the intercepts and residual variances is almost so common as to be expected (Marsh et al., 2018; Millsap & Olivera-Aguilar, 2012). The issue thus becomes: What are researchers to do with the information that MI does not hold? If noninvariance is widespread and severe, observable across many types of parameters and most indicators, the only thing to do may be to adopt a self-critical perspective and figure out what psychological processes are operating to produce such dramatic incompatibility. All

substantive conclusions should then be heavily tempered. If noninvariance seems minimal and/or it appears to have little practical impact based on effect size metrics, it could simply be ignored. For example, in Clark et al. (2017), it was rare for strict invariance to hold across children’s gender; however, for most scales that were invariant through the intercepts, the effect sizes for this MNI were trivial (e.g., dMACS , represents the variance–covariance matrix of the diffusion process in continuous time.

Subject-level measurement model The latent process vector η(t) has measurement model: yðtÞ ¼ ΛηðtÞ + τ + eðtÞ where eðtÞ  Nð0c , ΘÞ (3) y(t)  ℝcis the vector of manifest variables, Λ  ℝcv (LAMBDA) represents the factor loadings, and τ  ℝc (MANIFESTMEANS) the manifest intercepts. The residual vector e  ℝc has covariance matrix Θ  ℝcc(MANIFESTVAR).

Overview of hierarchical model Parameters for each subject are first drawn from a simultaneously estimated higher-level distribution over an unconstrained space, then a set of parameter-specific transformations are applied so that a) each parameter conforms to necessary bounds and b) is subject to the desired prior. Following this, in some cases matrix transformations are applied to generate the

III. Methods and statistics

890

34. Hierarchical continuous time modeling

continuous time matrices described. The higherlevel distribution has a multivariate normal prior (though priors may be switched off as desired). We provide a brief description here, and an R code example later in this work, but for the full details see Driver and Voelkle (2018a). The joint-posterior distribution of the model parameters given the data is as follows: pðΦ, μ, R, βj Y, zÞ∝ pðYj ΦÞpðΦj μ, R, β, zÞpðμ, R, βÞ (4)

Subject-specific parameters Φi are determined in the following manner: Φi ¼ tformðμ + Rhi + βzi Þ

(5)

hi  Nð0, 1Þ

(6)

μΝð0, 1Þ

(7)

β  Nð0, 1Þ

(8)

Φi  ℝ is the s length vector of parameters for the dynamic and measurement models of subject i. μ  ℝs parameterizes the means of the raw population distributions of subject-level parameters. R  ℝss is the matrix square root of the raw population distribution covariance matrix, parameterizing the effect of subject-specific deviations hi  ℝson Φi. β  ℝsw is the raw effect of time-independent predictors zi  ℝwon Φi, where w is the number of time-independent predictors. Yi contains all the data for subject i used in the subject-level model  y (process-related measurements) and x (time-dependent predictors). zi contains time-independent predictors data for subject i. tform is an operator that applies a transform to each value of the vector it is applied to. The specific transform depends on which subject-level parameter matrix the value belongs to, and the position in that matrix. At a number of points, we will refer to the parameters prior to the tform function as “raw” parameters. So for instance “raw population standard deviation” would refer to a diagonal entry of R, and “raw individual parameters for subject i” would refer to μ + Rhi + βzi. In contrast, without the “raw” prefix, “population s

means” would refer to tform(μ), and would typically reflect values the user is more likely to be interested in, such as the continuous time drift parameters.

Install software and prepare data Install ctsem software within R: install.packages("ctsem") library("ctsem") Prepare data in long format, each row containing one time point of data for one subject. We need a subject id column, named by default “id,” though this can be changed in the model specification. Some of the outputs are simpler to interpret if subject id is a sequence of integers from 1 to the number of subjects, but this is not a requirement. We also need a time column “time,” containing positive numeric values for time, columns for manifest variables (the names of which must be given in the next step using ctModel), columns for time-dependent predictors (these vary over time but have no model estimated and are assumed to impact latent processes instantly), and columns for timeindependent predictors (which predict the subject-level parameters that are themselves time invariant – thus the values for a particular time-independent predictor must be the same across all observations of a particular subject). id time Y1

Y2

TD1 TI1

TI2

TI3

[1,] 1

0.594 7.26 12.88 0

0.0538 1.366 0.2075

[2,] 1

0.889 7.50 12.65 1

0.0538 1.366 0.2075

[3,] 1

1.187 7.42 11.74 0

0.0538 1.366 0.2075

[4,] 2

2.406 3.55 9.03

0

NA

0.997

0.1521

[5,] 2

2.631 3.73 10.53 0

NA

0.997

0.1521

[6,] 2

3.014 2.58 8.31

0

NA

0.997

0.1521

[7,] 2

3.221 4.11 7.77

0

NA

0.997

0.1521

[8,] 3

0.000 6.32 9.18

0

0.5283 0.546 0.0284

[9,] 3

0.268 3.83 6.45

0

0.5283 0.546 0.0284

III. Methods and statistics

891

Overview

As will be discussed in detail later, default priors for the model are set up with an attempt to be “weakly informative” for typical applications in the social sciences, on data that is centered and scaled. Because of this, we recommend grand mean centering and scaling each variable in the data, with the exception of timedependent predictors, which should be scaled, but centered such that a value of zero implies no effect. This exception is because timedependent predictors are modeled as impulses to the system with no persistence through time, at all times when not observed the value is then inherently zero. Similarly, we expect a time interval of 1.00 to reflect some “moderate change” in the underlying process. If we wished to model daily hormonal fluctuations, with a number of measurements each day, a time scale of hours, days, or weeks could be sensible – minutes or years would likely be problematic. If the data are not adjusted according to these considerations, the priors themselves should be adjusted, or at least their impact carefully considered – though note also that an inappropriate scaling of time can also result in numerical difficulties, regardless of priors. ctstantestdat[,c ('Y1','Y2','TI1','TI2','TI3')]