105 76 6MB
English Pages 301 [292] Year 2023
Lisa D. Wijsen
Twenty Interviews With Psychometric Society Presidents What’s on the Mind of the Psychometrician?
Twenty Interviews With Psychometric Society Presidents
Lisa D. Wijsen
Twenty Interviews With Psychometric Society Presidents What’s on the Mind of the Psychometrician?
Lisa D. Wijsen Social and Behavioral Sciences University of Amsterdam Amsterdam, Noord-Holland, The Netherlands
ISBN 978-3-031-34857-0 ISBN 978-3-031-34858-7 (eBook) https://doi.org/10.1007/978-3-031-34858-7 © Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
Preface
In this book, the people who run the field of psychometrics form the center stage. Twenty Interviews With Psychometric Society Presidents: What’s on the Mind of the Psychometrician? is a compilation of the 20 interviews I held between 2016 and 2018, with Psychometric Society presidents, as part of my dissertation. In my dissertation, I have taken psychometrics as the object of my research, and investigated this intriguing discipline using a number of different approaches. One path that I have traveled—the path that resulted in this book—was a qualitative research project in which I asked 20 prominent psychometricians about their perceptions on historical, current, and future practices of their field. The transcripts of the interviews were the basis of a qualitative analysis1 of which a revised version can be found as the concluding chapter in this book. These chapters illustrates and elaborates on the diversity regarding the presidents’ approach toward psychometric research and highlights the dilemmas that stand at the basis of some fundamental disagreements among the presidents. However, the presidents’ testimonies are much richer in their content than the final chapter of this book can possibly portray. The interviews deserved their own platform, which has become this compilation. To collect a sample of diverse but representative testimonies, we chose the presidents of the Psychometric Society as our interviewees. The Psychometric Society was founded in 1937 by one of the most prominent psychometricians in history, Louis L. Thurstone, and has since been the most important professional association for psychometric research. Its presidents are democratically chosen by members of the Psychometric Society and the presidents’ work is considered leading in the field. The Society’s presidents are (or have been) representatives of many of the main research traditions, such as Item Response Theory (with Wim van der Linden, Chap. 11 and Klaas Sijtsma, Chap. 20 as representatives), Factor Analysis/Structural Equation Modeling (with Peter Bentler, Bengt Muthén, Jos ten Berge as
Wijsen, L. D. & Borsboom, D. (2021). Perspectives on psychometrics: Interviews with 20 Past Psychometric Society Presidents. Psychometrika, 86, 327–343. 1
v
vi
Preface
representatives, resp. Chaps. 3, 6, and 19), and Multidimensional Scaling (Jan de Leeuw, Jacqueline Meulman, and Willem Heiser, resp. Chaps. 5, 14, and 15). Though most presidents are part of one of these classic psychometric research traditions (or have been in the past), the presidents come from a variety of backgrounds and do research in a variety of topics, such as—to name just a few— Computerized Adaptive Testing (Hua-Hua Chang, Chap. 21), diagnostic classification (Bill Stout, Chap. 13), and evidence-centered assessment design (Robert Mislevy, Chap. 8). Sometimes the presidents are not trained as psychometricians per se but come from other areas in psychology (like Ulf Böckenholt, Chap. 16), or made a transfer from an entirely different field like mathematics (like James Ramsay and Ivo Molenaar, resp. Chaps. 2 and 9). Some consider themselves statisticians or consultants rather than psychometricians (like Paul Holland and Brian Junker, resp. Chaps. 7 and 18), others emphasize the role of psychology in their work (like Susan Embretson and Paul De Boeck, resp. Chaps. 10 and Chap. 17). Some are critical on the role and future of psychometrics (like Larry Hubert, Chap. 4), others are more optimistic of the future and purpose of the field (like David Thissen, Chap. 12). Overall, their expertise covers both the more traditional psychometric research areas and research areas that are more on the outskirts of psychometrics. Due to the presidents’ expertise and central position in the field, their testimonies are intrinsically interesting and relevant to any investigation into the practice of psychometric research. The chapters in this book are thoroughly revised versions of the original transcripts. The transcripts were often in need of further clarification, explanation, and/ or revision to make them ready for publication. Both the author of this book and the interviewees have participated in the revision and editing of the chapters, to improve both readability and accuracy. Note that the order of the chapters is chronological: the first chapter entails an interview with the first president of the Psychometric Society among the interviewees (James Ramsay), and the last chapter is an interview with the most recent president among the interviewees (Hua-Hua Chang). Amsterdam, Noord-Holland, The Netherlands
Lisa D. Wijsen
Acknowledgments
This book has come to fruition with the help and support from all the interviewees; not only did they allow me to interview them, but they also assisted me with extremely helpful suggestions and edits, for which I am highly grateful. I thus wish to express many, many thanks to Bengt Muthén, Bill Stout, Brian Junker, David Thissen, Hua-Hua Chang, Ivo Molenaar, Jacqueline Meulman, James Ramsay, Jan de Leeuw, Jos ten Berge, Klaas Sijtsma, Lawrence Hubert, Paul De Boeck, Paul Holland, Peter Bentler, Robert Mislevy, Susan Embretson, Ulf Böckenholt, Willem Heiser, and Wim van der Linden. I hope this collection of interviews does justice to their ideas about the fascinating field of psychometrics.
vii
Contents
1
Introduction�������������������������������������������������������������������������������������������� 1 Four Themes�������������������������������������������������������������������������������������������� 2 The Interviewee’s Career �������������������������������������������������������������������� 3 The History of Psychometrics�������������������������������������������������������������� 3 The Position of Psychometrics Among the Other Disciplines������������ 4 The Future of Psychometrics �������������������������������������������������������������� 5 The Merit of This Book �������������������������������������������������������������������������� 6
2
James Ramsay���������������������������������������������������������������������������������������� 7
3
Peter Bentler������������������������������������������������������������������������������������������ 21
4
Larry Hubert������������������������������������������������������������������������������������������ 35
5
Jan de Leeuw������������������������������������������������������������������������������������������ 49
6
Bengt Muthén ���������������������������������������������������������������������������������������� 63
7
Paul Holland ������������������������������������������������������������������������������������������ 79
8
Robert Mislevy �������������������������������������������������������������������������������������� 97
9
Ivo Molenaar������������������������������������������������������������������������������������������ 113
10 Susan Embretson ���������������������������������������������������������������������������������� 127 11 Wim van der Linden������������������������������������������������������������������������������ 141 12 David Thissen ���������������������������������������������������������������������������������������� 155 13 Bill Stout ������������������������������������������������������������������������������������������������ 167 14 Jacqueline Meulman������������������������������������������������������������������������������ 183 15 Willem Heiser ���������������������������������������������������������������������������������������� 199 16 Ulf Böckenholt���������������������������������������������������������������������������������������� 213 17 Paul De Boeck���������������������������������������������������������������������������������������� 225 ix
x
Contents
18 Brian Junker������������������������������������������������������������������������������������������ 237 19 Jos ten Berge������������������������������������������������������������������������������������������ 251 20 Klaas Sijtsma������������������������������������������������������������������������������������������ 261 21 Hua-Hua Chang ������������������������������������������������������������������������������������ 275 22 Themes and Visions�������������������������������������������������������������������������������� 283 The Substantive vs. the Data Analytic ���������������������������������������������������� 283 Theory vs. Application���������������������������������������������������������������������������� 285 Narrow vs. Broad Focus�������������������������������������������������������������������������� 287 PR Problems�������������������������������������������������������������������������������������������� 289 Conclusion ���������������������������������������������������������������������������������������������� 290
About the Author
Lisa D. Wijsen studied psychology between 2010 and 2015. In 2015, she received a research master’s degree in psychological methods, with a minor in philosophy of science. In 2021, she defended her dissertation Characterizations of Psychometrics, which is a collection of studies on the history and philosophy of psychometrics. Her research interests are psychological measurement, (qualitative) research methodology, and the history and philosophy of psychology. She is currently a lecturer at the Psychological Methods Department at the University of Amsterdam, teaching courses on the history and philosophy of psychology.
xi
Chapter 1
Introduction
The practice of scientific research is often associated with formulating research ideas, publishing research articles in scientific journals, and ultimately of course the development of knowledge. But research is not only made up of models, discoveries, or manuscripts. Often overlooked but completely essential are the researchers themselves, the ones who have the ideas, write the manuscripts, and keep the field up and running. And these researchers make decisions and choices which may determine the course of a discipline. In this book, psychometricians feature as the main characters, as they share their individual perspectives on the history, current state, and future of psychometrics in 20 interviews. Psychometrics is a subarea of psychology that is best known for its many measurement instruments, such as intelligence or personality tests, and methods for the quantitative analysis of measurement data. Psychological and educational measurements have become a powerful presence in many aspects of our lives: test scores might help decide whether we are accepted into the university of our choice, whether we should be diagnosed with a mental disorder, or get hired for our dream job. Whether we have encountered them at school, in our jobs, or in the (mental) health sector, many of us are familiar with psychometric applications. However, the target audience in this book are largely psychometricians who work on the quantitative end of the psychometric spectrum. Our interviewees usually do not construct items or administer tests but are the people who work “behind the scenes” on matters like drawing up psychometric models, improving measurement procedures, and developing methods for data analysis. Their research is often highly technical, abstract, deeply embedded in statistics, and often hard to grasp for people without their expertise. Traditionally, psychometricians concern themselves with modeling the relationships between one or more latent variables, such as cognitive abilities, and a set of observed variables, like item responses. Most (though not all) psychometric models fall under this category and are called latent variable models. The territory of most contemporary psychometric research are the technical aspects of these psychometric or statistical models, and psychometricians often investigate topics like model © Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_1
1
2
1 Introduction
extension, measures for goodness of fit, and methods for parameter estimation. Their research often goes hand in hand with writing software packages which allow the application of their models by other researchers. Many contemporary psychometrics thus work in the background of the larger, well-known psychometric applications. Some have direct connections to testing agencies like SAT or ACT (in the United States) or Cito (in the Netherlands); others develop methods for the analysis of psychological or educational data and provide methodological advice or solutions to applied researchers at their university. Most psychometricians are no public figures and their ideas and views are rarely the topic of conversation. This book intends to change that. Psychometrics is a field that has had a tremendous influence both on psychological and educational research and on society. It has influenced how psychologists have come to think of psychological attributes as measurable entities, how psychology has developed as a largely quantitative science, and how measurement and testing have taken up such an important position both in psychology and in society. The impact that psychometrics has had on our society and research on the hand, and the lack of knowledge about what it is these people actually do, requires a closer investigation into the field. It is thus high time we heard from these scientists themselves what it is that has inspired them to become a psychometrician in the first place and how they reflect on the practice of psychometric research. Twenty interviews with Psychometric Society presidents: What drives the psychometrician? tells the stories of the people, who – even though they are somewhat invisible to outsiders – are the driving forces of psychometric research, teaching, and practice. In semistructured interviews, 20 presidents of the Psychometric Society share personal memories and ideas about what they are most passionate about: psychometrics. The interviews tell the stories of how they moved into the psychometric field, what inspired them to pursue this path, and what still drives them to do the research they deem so important.
Four Themes The interviews can be read independently from each other, and the reader can pick those chapters he or she finds most interesting or relevant. However, read together, the compilation sheds light on the differences in visions held by the entire group of interviewees. Together, the chapters sketch an image of psychometrics that is characterized by proper and rigorous research on the one hand, but fragmented on the other: psychometrics turns out to be intrinsically multidisciplinary, both open to and closed off from new contemporary developments in other fields, and in its current shape, perhaps not very future-proof. There are four themes which run as main threads through each of the chapters: the course of the interviewee’s career, their views on the relationship between psychometrics and other fields like psychology and statistics, and their views on both the history and the future of the field, which I will discuss briefly below. Chapter 22,
Four Themes
3
the final chapter, provides a more in-depth thematic analysis of the interviews and addresses some of the main themes and issues that come up in several (if not all) of the chapters.
The Interviewee’s Career I do not think it is controversial to say that becoming a psychometrician does not have the same traction as becoming a doctor, a fire(wo)man, or even a psychologist, and it is a profession one might only become familiar with at a later stage in life. So how do people end up in this career? What were the important reasons for our presidents to pursue a career in psychometrics? Each interview started in the exact same way, namely, with the simple question, “How did you end up in psychometrics?” which served as an invitation for the presidents to share their memories of their own career. Why did they pursue psychometrics in the first place? What is their research agenda, and what aspects of their research is so intriguing to them? What do they consider important contributions of themselves? Who – colleagues, mentors, or teachers – were the main motivators for the psychometricians to do their research, and why? Many presidents share stories of the people they met in their own lives who sparked their interest in the field or who were crucial in finding their way into psychometrics. Though some of our presidents have distinct memories of an event in their childhood or adolescence that sparked an interest in psychometrics, like Susan Embretson’s encounter with poorly constructed test items (Chap. 10), or Larry Hubert’s disappointment when a large testing program confirmed that his “plans on being a cabinetmaker were over” (Chap. 4, p. 38), pursuing a career in psychometrics is not often a childhood dream. As it turns out, for most of our presidents, ending up in psychometrics was just a matter of coincidence, a career that – almost by accident – crossed their path, but which then turned out to be fascinating after all.
The History of Psychometrics Besides the many personal stories, Twenty interviews with Psychometric Society presidents: What drives the psychometrician? also provides a wealth of historical facts and knowledge that is relevant for each practicing psychometrician. Not only were the interviewees themselves significant scholars in psychometric research for the last couple of decades and are thus part of the recent history of psychometrics themselves, they are also more than equipped to reflect on the earlier history of the field. What do they consider psychometrics’ masterpiece? How has psychometrics influenced the society we now live in? And especially, which founder of psychometrics has inspired them the most, and why? The interviewees often share their experience of a unique moment in the history of psychometrics, often including fellow
4
1 Introduction
psychometricians, some of whom are no longer around. Ivo Molenaar (who sadly passed away in February 2018) remembered the once-in-a-lifetime experience of organizing a course together with Melvin Novick at the University of Groningen which attracted many psychometricians from around the Netherlands (see Chap. 9), and in Chap. 8, Robert Mislevy considers the course jointly taught by the iconic psychometricians R. Darrell Bock and Benjamin D. Wright as one of the most formative experiences in his education. This book preserves the important stories of both the interviewees and the psychometricians they remember for future generations.
The Position of Psychometrics Among the Other Disciplines It is not very controversial to say that psychometrics has a somewhat complicated position among the different related research disciplines, such as psychology and statistics. Early psychometrics had a very close relationship with psychology and was usually aimed at solving measurement problems in psychological research. But as you shall read in the following chapters, not all psychometricians consider themselves psychologists or a member of the psychology community. Contemporary psychometrics is no longer only measurement oriented and has instead come to involve research methods and practices that have drifted over from other disciplines (such as statistics and machine learning). These methods often aim for other, nonmeasurement goals, such as data analysis, data visualization, or prediction. These interviews invite the interviewees to reflect on the character and identity of psychometrics and its relationships with other scientific disciplines. What constitutes “proper” psychometric research, and how “properly psychometric” do they consider their own research? What is the relationship between psychometrics and neighbouring fields, such as psychology, education, or statistics? Do they consider it problematic that psychometrics has drifted away from psychology, or does it perhaps give opportunities and open up avenues where psychometrics can also make contributions? Twenty interviews with Psychometric Society presidents: What drives the psychometrician? thus shows how psychometricians perceive the position of own field, and as it turns out, their opinions and visions vary strongly. The interviews show that there are a number of distinctions or sources for disagreement. One distinction deals with the substantive versus data-analysis approach. Klaas Sijtsma (Chap. 20) and Paul de Boeck (Chap. 17) think that psychometrics should have a closer connection to psychological or educational science and consider it problematic that psychometrics has developed into a standalone discipline. According to these presidents, psychometrics should be largely informed by substantive theory and should also contribute to substantive theory. Brian Junker (Chap. 18) on the other hand considers psychometrics a type of statistics, a toolbox filled with all types of measurement models (among others), and thus essentially content-neutral (not substantive). Presidents like Brian Junker argue for a psychometrics that is easily
Four Themes
5
transferrable to other fields, where psychometrics models can be used for a variety of research problems. A second, related, distinction is between presidents who are driven by theoretical or foundational issues – substantive psychological questions or proving mathematical theorems – whereas others are driven by applied problems. Where the first may believe in the value of foundational research, the latter, for example, Wim van der Linden (Chap. 11) or Hua-Hua Chang (Chap. 21), are eager to build new measurement structures that have practical use outside of scientific research. These presidents want to have a practical impact on our society and, especially, on educational measurement. The last distinction I will briefly describe here is between the narrow and the broad focus. Some presidents, like Jos ten Berge (Chap. 19), are perfectly content if psychometrics were to stay within its original measurement bounds: it is first and foremost psychometrics’ task to improve psychological measurement, which is what psychometricians do best, and not let themselves be distracted in doing other things. In contrast, proponents of a broader account of psychometrics encourage psychometrics to spread its wings, to see in what other domains it can make contributions, and see which other domains do relevant work that the psychometricians could learn from. Psychometricians thus vary strongly in terms of their ideals about and approach toward psychometric research and its relationship with its neighbours, and you will find representatives from all the different positions in the following chapters. An analysis of the differences in vision and approach among the psychometricians can be found in the final chapter (Chap. 22).
The Future of Psychometrics Besides a testimony of the president’s own history in the field, the interviews also provided an excellent opportunity for the interviewees to reflect on the challenges that lie ahead. Presidents like Ulf Böckenholt (Chap. 16) and Hua-Hua Chang (Chap. 21) point out that these and the future are exciting times for psychometrics: different types of behavioural data are becoming easily available, and due to advanced technology and advanced analysis methods, these data may be a fruitful source for the future of the behavioural sciences, including psychometrics. With this in mind, the future of psychometrics seems promising, especially to those with a broader conception of psychometrics, i.e., for the presidents who do not just associate psychometrics with the more traditional topics in psychological or educational measurement. Several presidents name exciting challenges with respect to these upcoming data sources and what psychometrics can contribute to the analysis of these sources, but Jacqueline Meulman (Chap. 14), for example, expresses her worries about these developments: we should not put too much trust in new trends like big data or machine learning since these techniques come with their own problems that we should be wary of, and the psychometrician’s expertise and strength simply
6
1 Introduction
lie elsewhere. Moreover, overfocusing on these new developments might take much-needed attention away from the unsolved problems that still exist in psychology and education. It becomes clear from these interviews that the psychometricians, though perhaps for different reasons, are not entirely confident about the future of their field. In these chapters, they share their worries and hopes for the future of psychometric research.
The Merit of This Book Ultimately, this book provides 20 unique answers to the question “what drives the psychometrician?” This book shows both the diversity of psychometrics as a research area (the variety of topics that fall under psychometric research) and the diversity of the psychometricians themselves (the variety of their approaches, ideas, and visions about the field). The testimonies of these psychometricians are not only valuable because of the presidents’ thoughts and ideas about psychometrics, but they also invite the reader to reflect critically on what holds this seemingly fragmented field together and what challenges need to be solved in the future. This book is a relevant source for anyone who wants to improve their understanding of psychometrics and its wide variety of approaches, for members of the psychometric community who want to make a contribution to this research domain, and for those who want to take a stance with regard to historical, current, and future developments of psychometrics.
Chapter 2
James Ramsay
“It turned out that mathematics was so much fun that I got completely addicted to it.” James Ramsay was president of the Psychometric Society in 1981. He earned his Ph.D. degree under Harold Gulliksen’s supervision in 1966 at Princeton University. He is currently emeritus professor of quantitative modelling at the Department of Psychology at McGill University. His expertise lies in multidimensional scaling and functional data analysis.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_2
7
8
2 James Ramsay
How did you end up in psychometrics? My intention was to be a teacher of English in secondary school. I grew up in a small town in the Canadian prairies, where there weren’t a lot of ways of knowing about different ways you could spend your life. I have always really loved literature and am an avid reader. So I thought it would be fine to be a teacher of English in high school; I still think that, sometimes! But I was good at mathematics of course when I was in high school, too good in many ways. I felt it was kind of a boring subject because I didn’t particularly see the motivation in cracking polynomials. So as a teacher of English, I entered the Faculty of Education at the University of Alberta with a major in English teaching, but I also elected to do a minor in mathematics. It turned out that mathematics was so much fun that I got completely addicted to it. I took a course in introductory statistics, but it was really a calculus course that nailed me and reshaped my life’s work. During the 4 years at the University of Alberta, I took all of the mathematics courses that I could squeeze into my program, and a course in mathematical statistics; but I also did mathematical logic and differential equations and a number of other things. So I graduated with something that might be seen as a weak BA in mathematics. Not enough to get me into graduate school in maths but substantial nonetheless. But educational psychology was a really interesting subject, and so I ended up doing psychology as well. And because of the education faculty, there was a Ph.D. program in psychometrics at the University of Alberta, which was run by a wonderful guy, Steve Hunka, who really took me under his wing and made sure that I could take graduate courses while I was an undergraduate. Because of the psychometrics program, I knew about psychometrics as an undergraduate and Steve said, “You know, you should apply to psychometrics someplace. You probably won’t get in because of your limited background in mathematics and your background in education, but you should try and apply.” So, I selected three places: Princeton, University of Illinois, and Berkeley, applied, and got into all of them. That was mainly I think on the strength of my performance in the psychology GRE. I really studied psychology as a field just as an avocation, just for fun, and I loved it. I read Steven’s Handbook of Experimental Psychology1 from front to back, so I really had a lot of substantial background in psychology as well. I chose Princeton for a completely trivial reason; it was featured in a National Geographic article a few months before. You started your graduate education at Princeton. Who was your advisor? My supervisor was Harold Gulliksen. He was the granddaddy of test theory, and he had written the most famous book on classical test theory.2 He was close to retirement, and he had probably seen his best days as a researcher. He was a very nice man, but his knowledge of contemporary statistics at that point was pretty limited. What he did was make sure I was in contact with all the best people that were in psychometrics.
Stevens, S. S. (Ed.). (1951). Handbook of experimental psychology. New York: Wiley. Gulliksen, H. (1950). Theory of mental tests. New York: Wiley.
1 2
2 James Ramsay
9
And what was your dissertation about? My dissertation was in multidimensional scaling. But my department said “you have to do an experiment, you can’t just do mathematics,” so I said “good, let me do color vision.” I repeated some of Torgerson’s work on color vision,3 but I used a statistical approach, because my thesis was really on a statistical model for multidimensional scaling. I was aiming at the kind of treatment Karl Jöreskog did for factor analysis. Although the thesis was about a statistical problem, it did have some pretty diagrams of colors. Did you enjoy doing the experimental part? I thought that was such fun, I really did. But of course, it was the statistics that really captured my imagination. What about statistics or mathematics captured your imagination? I think that I probably have the mind of an engineer, in a way. I love mechanisms, I had a huge Meccano set when I was young, and a lot of model airplanes. I had a particular enthusiasm for anything that had a kind of mechanistic character, and maths had that property for me. What the calculus instructor did for me was to see beyond that and realize that calculus was not simply a lot of rules about how to calculate derivatives and integrals but that it was really confronting some deep questions about infinity, and the infinitesimal, that were quite troublesome and took a long time for mathematicians to resolve. The derivative has been defined in rigorous terms in mathematics for at least a couple of centuries, but the integral is still a subject of active investigation as a concept. It’s much more challenging to see beyond the engineering aspect of mathematics into that world where people still argue about the role of topology in measurement theory. That was wonderful. That kind of got the philosophical side of myself. And it really is the side of myself that loves literature in fact. It’s interesting that you both love literature and also love the mathematics, but I often hear that mathematics has a philosophical side to it. Indeed it does, and a romantic side. A romantic side even! How did you continue after your Ph.D.? After my Ph.D., there were a number of places I could go. The first place I needed to go to was back to Edmonton to marry my wife, which is now 50 years ago, and we’re still happily together. Very happily together! I went back to take care of that little deed and an opportunity had come up. I was told there was a 1-year temporary appointment at University College London. London had a huge cachet in 1966 when I graduated. There were the Beatles, Carnegie Street, and it was just a place where things were happening. The city itself was a total mess, still bombed from the war, but the romance of London was just unspeakable, fantastic, so we said well, let’s go to London! Torgerson, W. S. (1958). Theory and methods of scaling. New York: Wiley.
3
10
2 James Ramsay
About halfway through my time in London, or what I thought was my time in London, I saw that McGill had an opening. Somebody I knew was leaving and this position opened up. The chair of the department contacted me and asked me if I was interested in a position, knowing of course that I came from Alberta and that I already knew something about McGill, so I said that I really would. McGill had the most famous psychology department in Canada because Donald Hebb was there. George Ferguson had done some work on factor rotation, and he said “well, I’m not sure we have the money to fly you over for an interview” and I said, that’s alright, I’ll go anyway. I went to McGill, I would not say unseen, but without the usual interview process, which would never happen in these days. I think about a week or two after that, I mentioned to the chair of the department UCL, that I’d taken this position at McGill. He was suddenly shocked; he told me that he assumed I was going to stay there! They wanted to keep you. But nobody had ever said anything like that! The life of anybody who goes into a psychology department with a quantitative background could be hell, because of all the people that want help with statistics. Also, in those days computing was a very challenging thing for most people, so I would spend all week with lineups outside my door. George Ferguson had already been through that, so what he did was to hire a masters-level statistics person to do the department statistical consulting, to take all that off my shoulders, so that I could do research like everybody else. And that made a huge difference in my career. You preferred the research rather than the consultancy. Oh yes! I love teaching, but research is of course what you live for, it’s why you do a Ph.D. What were you working on at the time? Extending multidimensional scaling in various directions. I worked on multidimensional scaling from when I joined the department in 1967, to roughly the mid-80s, and I got into a lot of other things, too. I’ve always been a little bit diverse in terms of what I’m interested in, so I got into Bayesian statistics, for example, and even into numerical analysis. I had taken a great course in differential equations, and I became very interested in techniques for approximating solutions for differential equations, especially stiff differential eqs. I worked on that for a year, but just about the time I was ready to write a paper on it, C. William Gear from University of Illinois wrote the defining paper on the problem!4 That’s the fortunes of war in research; if you’re working on a really interesting problem, chances are other people Gear, C.W. (1969). The Automatic Integration of Stiff Ordinary Differential Equations. In A. J. H. Morrell (Ed.), Information Processing (pp. 187–193). New York: North-Holland Publishing Co. 4
2 James Ramsay
11
are also working on it. Subsequently in my life, that work paid off in many ways, so I never felt regretful about it. When I Google you, I stumble upon a lot of functional data analysis. Exactly. Can you explain a little about functional data analysis? Sure, yes. In the department of psychology, my job was to teach multivariate statistical methods to graduate students, which remained my bread and butter course through my entire time there. Due to my thesis work, I’d already had a pretty good exposure to multivariate statistics, and so I continued to do some research on various topics, like factor analysis and so on, and published in Psychometrika. I went to France for a sabbatical in 1980 and spent a year there. I had no inkling of what was exactly going on in France, but there were two remarkable things happening. The first was all the wonderful work in France on spline functions. The French really took splines to a new level. What are spline functions? Okay, spline functions are basis functions; they’re a set of functions that can be used to approximate practically everything. They’re enormously flexible, and they proved to be absolutely crucial to functional data analysis. Lisa, you remind me of course that your background is not in psychometrics. I know the basics, but this is certainly beyond my knowledge. Right. The French statistical community was really under the thumb of the pure mathematicians. French mathematics had become in many ways insanely abstract, and they were using very abstract formulations of multivariate statistical problems. My first reaction at the time was that this was just window-dressing, making something look fancy that doesn’t need that kind of treatment. But, the more I thought about it, the more I thought, this kind of Hilbert space stuff that they were using could really be used also when the data are functions. In my various researches up to that point, I had already begun to think about how to bring functions as parameters into statistical analysis, being much inspired in fact by work in Bell Labs by Doug Carroll and Joe Kruskal, people have mentioned them for sure. That insight, coupled with the fact that there was a small group of very good people already working on that idea—so it was definitely not my invention—was great. And by the end of the year, I got pretty good at French and I could read anything in French. I much preferred reading mathematics in French than in English in fact, because they write very beautifully. I got involved with one of the people that was doing the work I admired the most, Philippe Best in Toulouse. Philippe and I got
12
2 James Ramsay
along very well, and I said, “Philippe, come over to spend a couple months in Montreal, and I’ll pay your way in and put you up somewhere and we’ll work together on this problem.” So we did, and that was the beginnings of my work in functional data analysis. At the same time, my perception of multidimensional scaling was that it was, in my view, a very interesting technique and it had a lot of applications but that there seemed to be not much interest in psychology in using it. I hear that a lot. If you talk about the relationship between psychology and psychometrics, I would say it’s both distant and uneasy. The psychologist needs psychometricians badly for elementary statistical consulting. But once they have gotten what they need, they don’t want to hear anything else, so, statistically speaking, psychology is a very conservative community, even in a very good department like mine. I felt there wasn’t any point in elaborating a method for which there wasn’t much of a user community, and it didn’t seem at that time likely there was going to be such a community, so I kind of said, “I want to kind of get out of this field, it is already over researched, relative to the number of people who want to use it.” That’s an interesting question too, about all kinds of fields. There are always parts that nobody thinks about and they should think about, but there are parts where far too many people are working on. And multidimensional scaling was… At the time, MDS was certainly in that category, in my view as well as some of the other people who came to similar conclusions about the same field, so that’s a good example of an overcrowded field. And probably test theory is overcrowded as a field as well, and that has always has been the case I would say, at least at the statistical end of it. Functional data analysis seemed to me to be a very exciting thing to get into, and I could see that this is just what we needed in all kinds of areas. Even within my own department, I immediately saw several important applications which I figured I’d end up writing as well. And how different is it from other psychometric techniques? Well, it’s very different from multivariate techniques because functions are of course essentially infinitely large vectors if you like. But the key thing you have in functions that you don’t have in discrete vector space is the possibility of taking derivatives, integrals, all the other things you can do with functions. Basically, all that is defined by the concept of smoothness in a function. So functional data in a sense has a much wider range; it’s a much richer area to work in, in terms of the number of modeling strategies that you have, let’s put it that way. So already by 1985, I had pretty much signed off multidimensional scaling and that area of work, and my presidential address at the Psychometric Society was in fact on functional data analysis.
2 James Ramsay
13
Then in 1991, I went on to publishing a paper in the Journal of the Royal Statistical Society,5 a discussion paper. It’s a huge thing when you publish a discussion paper in JRSSB, so that instantly really gave the field a lot of visibility. Bernard Silverman had already been kind of thinking along the same lines, so at a conference in 1991 before this paper came out, he said that we had to write a book together. And he had already written two books, so he was a very well-known guy in statistics. so Silverman and I started to work on this book and had it finished by 1997.6 We followed it up by a book about applications, which was an easier introduction for a lot of people.7 That was certainly the take-off point. Did functional data analysis have a bigger crowd than MDS? Oh my gosh, yes! But now MDS has come back, and there are people in other fields like market research, chemometrics, and much more recently machine learning as well who are very interested in MDS. Strategies for estimating things like manifolds, curves spaces, has become quite an interesting technique again. And it’s also widely used in astronomy, believe it or not. I believe you, but I don’t know astronomy! It’s not that MDS has disappeared, but it never came back to the landscape in the social sciences I would say, except for market research. But functional data analysis became kind of crazy, I can’t begin to keep track of it now. Are people in psychology using it as well? Yes. So what kind of research do they do? What kind of data do they analyze with functional data analysis? Data which are distributed over a continuum like time or space or wavelength, geographical space; any type of data where there’s kind of an underlying subscale that’s continuous. Particularly now, functional data analysis is extremely important in climate research. In climate research, you’re dealing with data in the atmosphere and the oceans, it’s distributed over space and time in huge amounts, so in that field it has really become very important.
Ramsay, J.O. and Dalzell, C. (1991). Some tools for functional data analysis (with discussion). Journal of the Royal Statistical Society, Series B,53, 539–572. 6 Ramsay, J. O. and Silverman, B. W. (1997) Functional Data Analysis. New York: Springer. 7 Ramsay, J. O. and Silverman, B. W. (2002) Applied Functional Data Analysis. New York: Springer. 5
14
2 James Ramsay
Have you personally worked with psychologists? I always have. I collaborated with a lot of great psychologists. For example, it turned out that functional data analysis was just what people needed to analyze how the brain controls human movement, how muscles are activated by neural impulses and produce small motions as a result. I’ve done a number of papers in the area of motor control.8 And I have a colleague, David Ostry, in my own department who is a very well-known psychologist. So immediately I had those applications. Darrell Bock and I also worked on human growth.9 I’ve also done a lot of work on weather data because the data is easily available and everybody knows about weather so it’s a natural introduction to the area. At IMPS or in psychometrics, I think the emphasis lies on test theory. Is functional data analysis a smaller field within psychometrics? It depends on what you mean by functional data analysis. Now my understanding of functional data analysis is probably broader than a lot of people have, because I see as a functional data analysis problem where the model is itself a function. In psychometrics, that’s exactly the situation: you have the item response function that is used to determine whether a person passes or fails an item. Of course you also estimate that person’s ability and a major central task in test theory is to use data from a test administration to both estimate the abilities of the examinees and also the item response functions. My approach to test theory would be a bit divergent from most people’s. I haven’t been particularly comfortable with the idea of just a single low dimensional family of models, the famous 2PL, 3PL models, and so on. I’ve seen enough test data to know they don’t fit the data very well, that we can do much better if we had more flexible methods. And I also began to understand they aren’t actually the right models: the right model contains the log odds function of probability, because that’s a much more natural functional object in a model, and from that you can go back to the probability curve. So my psychometric work is bang in the middle of functional data analysis. I see the psychometric community as having inherited a lot of conservatism from psychology. Many psychometricians are in psychology and education and I would say that, looking at psychometrics as a field, it has not made full advantage of the great developments in statistical theory, statistical practice and computational technology that it could have. It hasn’t tried to do that. In my time as president, and in my time over the years in the board of trustees and the editorial council, I tried to push the Society into getting into contact with the statistical community. In a way, this conference has done a fairly reasonable job of doing that, but I must say, historically, it’s a little
Ramsay, J. O., Gribble, P., & Kurtek, S. (2014). Description and processing of functional data arising from juggling trajectories. Electornig Journal of Satistics, 8, 1811–1816. 9 Ramsay, J.O., Altman, N. & Bock, R.D. (1994). Variation in height acceleration in the Fels growth data. Canadian Journal of Statistics, 22, 89–102. 8
2 James Ramsay
15
bit like its close cousin, econometrics, that tended to not interface with the main statistical community as much as they could have. Do you think that’s maybe changing or would you say that we’re still stuck? I wouldn’t want to answer that, frankly, because we are at my first meeting in 11 years, and so lots has happened in between. I think it depends on whom you talk to and what they’re working on. Some areas have evolved a lot, other areas I’m not so sure. You mentioned that the relationship between psychology and psychometrics is a little fuzzy. What do you think is the reason for that fuzziness? Well, I think the reason is that psychology still draws its student population from a population of people who are more oriented toward the human aspects of psychology, who don’t see the scientific aspects of psychological research as being really central to why they want to get into psychology. This is especially the case in North America. The students that I worked with learned very quickly because they really have to know a certain amount of statistics to get papers published, but by and large, their attitude toward the mathematical sciences is that they would learn no more than they absolutely have to learn, with some graphic exceptions of course. I’ve actually surveyed how, in journals, for example, psychologists use and analyze data. A long time ago, Lee Cronbach estimated that 95% of journal papers in psychology only use the analysis of variance as a data analysis tool. In my view, that percentage still remains somewhere around 90-plus percent. I’ve surveyed journals in fields like personality and social psychology, where you’d expect them to use multivariate methods, and there again I saw that they often massage the data so that it looks like an analysis of variance problem. And that’s what they use. So I would say that even the statistical technology that I taught has had limited impact on psychology. I still love psychology, I have a very happy relationship with my department, and our department is one of the best in the university, but when I look at their research, I have to say that in a lot of ways, they need to seriously rethink some of their bad habits. So how are we going to change that? What would be the first step? In a sense, one of the groups that could work for improving the quality of data analysis in education and psychology is the Psychometric Society, just as the American Statistical Association does more generally. Instead they seem to have decided, “Let’s just be a little community of people who study test theory, multidimensional scaling and so on, and forget about the people we work with on a day to day basis.” They could have had a much more activist position with respect to critiquing the statistical practices in psychology than they did.
16
2 James Ramsay
So you think the relationship between psychology and psychometrics should be one of… …mutual support. We need them, we need their data, we’re very interested in their problems, but they also need us. For example, one of the worst practices in psychology in terms of research practices is the use of first-year undergraduate students to provide data in return for course credit. Well, that’s insane, I mean, Lisa, how many hours do I have to explain why that’s such a bad practice! Try! One minute. Well, start with the fact that you’ve got a subject that already knows a lot about psychology, and can easily manipulate the instructor by giving the data he thinks he or she wants. Moreover, if you’re making participation in an experiment, a course requirement, you’re effectively working with slave labor. And then you add the fact that it’s a subject population, first-year and second-year university students, who are very unusual people. They are nowhere near the population at large you claim to be studying, in any respect. In age, in income, in health, in intelligence, family background, SES, you name it. These people are way out on the fringes of society. The fact that the community would be satisfied to do research aimed at revealing something about human beings based on such a weird population is just not scientific. And that’s only a start. Fair enough! So what role does psychometrics have in society? If you were to ask what mathematics has contributed to social well-being of populations, probably testing is, believe it or not, the biggest single contribution of mathematics to society. Maybe some people would argue that its role in financial systems might be a rival, and they might be right, I’m not sure, and then there is certainly the gift of the nuclear bomb. Like I said, perhaps 100 million Americans have been tested per year. And those data have been analyzed, not necessarily well, but they’ve been analyzed using methods which were still evolving in the psychometric community in the twenty-first century. So, it’s huge. And what is psychometrics’ biggest scientific achievement? I think especially in its early days, Psychometrika published some really pivotal papers that went on to have a great impact on statistics in general. They really opened up new angles. Multidimensional scaling, factor analysis, and structural equation modeling would be examples. The first paper to use the term “functional data analysis” also appeared in Psychometrika, in my presidential address! The psychometric community has had a history of really spinning off interesting takes on interesting problems and doing interesting things. Its impact on the rest of the mathematical science has been substantial.
2 James Ramsay
17
Are you best known for your work on the functional data analysis? Yes, definitely, and perhaps also for parameter estimation for models defined by differential equations. Is that also what you’ll be remembered for in later years? Who wants to predict that?! What was most influential? At this point, functional data analysis is certainly far more visible now in the statistical community than in the psychometric community, and that’s one of the reasons it’s been 11 years since I’ve been to this conference. IMPS meetings seem always to conflict with ones where I’ve been an invited keynote speaker or something like that. Is it a good experience to be back? It is, a wonderful experience! People, I must say, have been very nice to me. So I’m glad I’m back, even at this advanced age. Who has really inspired your work? John Tukey. Why? He was a Princeton faculty member. He hardly ever was at Princeton, since he was busy running around doing military research and a whole bunch of other things, but he had a vision about statistics and how it would relate to mathematics, which I think was a big advance over what was prevailed at that time and still perhaps prevails in the statistical community. He was already a very famous mathematical statistician, but he wrote a paper in 196210 called The Future of Data Analysis, in which he laid out his perspective that statistics had become overly mathematical, that it had been spending far too much time solving problems which were essentially invented, and that had no relationship to what statistics really existed to do, which was to use mathematics to reveal interesting insights from data. It still is in a lot of ways, a real renegade paper that’s really inspired me. And of course when I was at Princeton, I saw Tukey in operation. Once I gave a talk over in the mathematics department and he was in the audience.
10
Tukey, J. W. (1962) The future of data analysis. The Annals of Mathematical Statistics, 33, 1–67.
18
2 James Ramsay
Did you talk to him? I never got to know him well; and I’ve never had a face-to-face talk with him. But I have seen him many times of course, and I had good friends in the graduate college at Princeton who were his students. So I knew a lot from them. And that was exciting. Would you say that his work also is the most important psychometric work of all time? Well, his impact on psychometrics itself wasn’t as huge. He did some interesting things for sure, but he acted more as a kind of inspiration and a consultant for ETS and other psychometric operations, but he didn’t publish much in psychometric literature. So what is the most psychometric article or book? I don’t think I could say. I don’t think I’m qualified to say that anymore. I think since we’re trying to be as honest as possible, I really couldn’t answer that question. I’m eager to learn what other people think about that. Fair enough. We already talked a bit about what psychometrics has got to do in the next coming years, but what do you think is its biggest challenge for the future? I actually talked about that in this conference. Testing is still the core of psychometrics, and it’s still the area where its impact on society is the greatest. I think that it needs to reformat test theory into a one that is readily intelligible to people out there who take tests. That means it should represent ability on a closed interval, let’s say running from 0 to 100, or 0 to number of items, or something like that, that is familiar and natural for people. I think this is a big priority. And the fact that it didn’t do that meant that we never convinced people who analyzed test data, including the people who run ETS and ACT and other big testing organizations, that it would be beneficial to not just count the number of items that were correctly answered. Sum scoring, as I call it, is still almost the universal method for estimating ability, and we can do much better. Now it definitely does well enough to have justified its place in society, most people are very happy to have their scores recorded like that, and they readily understand why that would be a useful thing. And it is a useful thing. But in most of types of methodology aimed at improving the lot of humans, improvements are very incremental, and progess doesn’t leap forward by 50% or 100% or even 20%. Medicine has done huge improvements in all kinds of directions. For example, the life expectancy of someone with breast cancer is now substantially greater, but this improvement has been gained 1% at the time. We could overnight produce an improvement of the order of 50%, at only the cost of developing suitable software, if people were to switch to efficient scoring of tests. That would be a big leap forward.
2 James Ramsay
19
The way I put it is, if you were working with the stock market and you could produce a 5% improvement in return on an investment, you would instantly take adopt the method. If you do that in medicine, you would get a Nobel Prize. That’s the nature of things; the importance of an improvement should be measured by the number of people it affects. And even a 1% improvement in the way we handle testing data, aggregated over the millions of people who are tested, is a huge benefit to society. There would be a lot of people who deserve to get into university who would get in by that benefit, and others who don’t deserve to get there would not get in. A major change A major change, yes. And we have it at our fingertips; all we have to do is reformat test theory in a way that works alongside the sum score, so you can do direct comparisons. And we could have done that 30 or 40 years ago. So why is psychometrics slow to change? That’s a great question, but you could also ask that question about science in general. It took people a long time to adopt the idea of continental drift, for example, it took a huge amount of time for people to recognize that smoking was toxic, that the climate is warming, and the list goes on. And the scientific community has to accept responsibility for that. At this point in time, why are opioid drugs being so heavily abused, both by the medical community as well as the people who use these drugs? I mean, what is there to understand? How long is it before we realize the hazards of putting radioactive substances in the ground or working with nuclear reactors that are doomed sooner or later to fail, whether it’s due to a tsunami or an earthquake, or a terrorist event. It takes time. It takes time, and sometimes it takes a lot of catastrophes. Unfortunately, in testing the catastrophes are measured one single person at the time, so they’re kind of invisible. So do you have a role to play in the coming years? Do you still have plans? I do! I think I’d like to come back to the Society and kind of say “Come on folks, we should do better and why don’t we do better.” Why don’t we take responsibility for the beneficial aspects of our work, to make sure that it’s delivered and that it involves not only things like reformatting test theory but also taking advantage of modern computing and Internet technology on a much larger scale than we do. With that I think we’ve come to the end of this interview. It was delightful.
20
2 James Ramsay
Thank you very much. Is there anything you’d like to add? I think I already pointed you in a couple of directions: what are the black holes and more valuable areas in psychometrics. And also: which areas have become fossilized in the sense of more people going in doing them that we really need. To some extent we should be proactive in diverting the enthusiasm of young people into directions where there is more opportunity to do something new. So psychometrics should also become broader than just test theory. It should always be aware that it’s only a corner of the larger enterprise of data analysis, and recognize that there are other aspects of science that have essentially the same problems we do. We need to share our expertise.
Chapter 3
Peter Bentler
“Psychometrics should be considered more broadly than just measurement in my opinion.” Peter Bentler is professor of psychology and statistics at UCLA. He earned his Ph.D. in 1964 at Stanford University in clinical psychology. Peter Bentler was president of the Psychometric Society in 1982. Peter Bentler’s research interests lie both in psychology and psychometrics, among which are topics like personality, drug abuse, and methods and applications of causal modeling.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_3
21
22
3 Peter Bentler
Thank you, Peter, for participating in this project on the history of psychometrics. I’ll be asking questions about three themes: your own career, the relationship between psychology and psychometrics and other fields, and your view on the history and future of psychometrics. I always start with the question: how did you end up in psychometrics? To answer that I have to explain how I even ended up in psychology, because I didn’t start out that way. I was an undergraduate anthrosociology major, and I got my Bachelor’s degree—we have a particular system here, Bachelor’s, Master’s, Ph.D. —and after I got my BA in that field, I got a job. And that job turned out to be in a small groups laboratory that a local military defense-related company was getting ready, and there, for the first time, I met Ph.D. psychologists. I had not attended any psychology courses, no quantitative or statistical courses. I worked there for about a year, and, as I said, I met various people who inspired me to go to graduate school in psychology. Of course, psychology in that time for me meant social psychology, because that was closest to sociology. So I went to Pittsburgh for a couple of years, studying social psychology, and then I figured that I might not be able to make a living in this. I ought to go on and do something where I can get a job afterward and feed myself. I didn’t think social psychology would do it, so I went into clinical psychology. I got my Ph.D. at Stanford. At that time in Stanford, there were a couple of new developments in psychology in the air. One of them was behaviorism as applied to real people, not in the sense of the Skinner boxes, for example, Al Bandura’s work. Another was mathematical psychology and quantitative methods in the sense of mathematical psychology, so there were Atkinson, Pat Suppes, Estes came later. Once I started learning clinical psychology, I already started to get nervous about where things were going. I knew things were going in the behavioral direction, and it seems like the whole field was becoming more quantitative oriented, so I decided I had to learn more quantitative methods. I would say I left there with an interest in quantitative methods, but not really quite knowing what to do with them. I got a post-doc at the Educational Testing Service, but not in their quantitative statistics group, not with Fred Lord, but with Sam Messick, who was in personality assessment, though he moved to educational assessment. And so, I learned a bit more about the kind of things Fred Lord was doing at ETS at that time. Karl Jöreskog also came by to give a seminar. So, that just continued to pique my interest in quantitative methods. And then I went on the job market and got this job here; I have been at UCLA ever since and that has been a long time now, 52 years. I started out here in a position that was a joint position in clinical psychology and personality psychology, not in quantitative methods. But as I said, I had become interested in them, and I had been doing some self-study at that time and became interested in assessment issues. Let me back up a minute, my advisor had been a visitor at Stanford…
3 Peter Bentler
23
Who was your advisor? He was Douglas Jackson, who was in personality assessment. I also had courses with Jerry Wiggins who was big on assessment methods and personality; he has a book on personality and prediction.1 I felt clinical psychology was a little too wishy- washy about the quantitative aspects of assessment so I just started learning more and more about it. I started having some success; my first real psychometric publication, in Psychometrika,2 came 3 years after having left ETS. I just kept hacking away at it, enjoying it. If you had asked me why I kept going at this career, it was because I kept enjoying it. I considered it a challenge, an intellectual challenge, to get my competencies up to the level that they needed to be, to try to solve problems that weren’t quite solved. And so after 10 years at UCLA, I left the clinical area totally and started concentrating more and more on quantitative methods and I have been doing that ever since. Somewhere in there, you have a question about whether I consider myself a psychologist, and I would say that, yes, I do. I really went through psychology but it’s much more than that. I’ve also had students who were interested in assessment related topics in psychology, in child development. I also had an interest in sexual behavior at that time, and there were measurement issues there as well. I also got involved with quantitative statistical methods in adolescent drug abuse research. I actually have a rather long history of doing work on adolescent drug abuse and research on young adult consequences of early adolescent drug use. I usually worked on this with colleagues who took the primary lead.3 They took the lead and I went along because it allowed me to do some quantitative things. I actually got financial support from the National Institute on Drug Abuse. These were bigger grants that had a lot of drug abuse aspects but that also let me have a little quantitative aspect, which is what I became more and more interested in and concentrated on more and more. You’ve always collaborated with applied psychology researchers? To be honest, it’s a twofold thing. One, that’s where the money is, and it helped me to get support to, for example, pay for graduate students. In the setup that was here at the time, there was very little departmental support for graduate students. If you wanted to support grad students, you had to have a grant in order to put them on there. Two, I found that by working in those areas, I found challenging issues. For example, when Karl Jöreskog4 introduced structural equation modeling along with Wiggins, J. S. (1973). Personality and Prediction: Principles of Personality Assessment. Reading, Massachusetts: Addison-Wesley. 2 Bentler, P. M. (1968). Alpha-maximized factor analysis (Alphamax): Its relation to alpha and canonical factor analysis, Psychometrika, 33, 335–345. 3 Newcomb, M. D. & Bentler, P. M. (1988). Consequences of adolescent drug use: impact on the lives of young adults. Beverly Hills: Sage Publications. 4 Joreskog, K. G. (1977). Structural equation models in the social sciences: Specification estimation and testing. Applications of Statistics, 265–287. 1
24
3 Peter Bentler
Keesling5 and Dave Wiley6—both of them have been kind of forgotten historically—all of their methods were normal theory based. But when you start in drug abuse research, and you do surveys about adolescent drug use, for example, on hard drug use among adolescents, you find it’s not a normally distributed trait. Most people don’t do any hard drugs, and then there are a few out there who do a lot of drugs. From my point of view, to feel right about it, you want to use methods you think can answer the questions correctly. So a huge challenge across years and years has been doing appropriate statistical inference in quantitative methods when the standard assumptions are not met. I’m still doing that. I have a student finishing tomorrow, with a dissertation on that topic, I have another student finishing in 2 weeks on that very topic, so these are ongoing interests, stemming from real particular kinds of real data problems. I don’t think many psychometricians started out as an anthropology or sociology student; are those still interests of you? Not specifically, no. I have an interest in people and things that are a little different from what I encounter in my daily life. In that sense, yes. Although in the last 20 years or so that interest has been focused more on going to exotic places and seeing species other than our own: wild animals, polar bears, tigers, lions, things along that line. That’s a hobby? Yes, traveling. What about psychology appealed to you so much that you made that switch from anthropology to psychology? At that first job, where I met the psychologists, I started understanding what they did, and how it related to the kind of things I had learned in sociology. It seemed interesting, and it was challenging. Starting off as a grad student with essentially no psychology background, I had to learn everything, what proprioception means, for instance. There are just thousands of words that psychologists throw out that I hadn’t been exposed to. Is most of your research applied to psychology, or is there also part of it that is just psychometric? Well, I mentioned my own substantive research in psychology which has been on drug abuse, but that’s because I had colleagues who were interested in that and I was interested to the extent they were, because it allowed me to focus in later years on the Keesling, J. W. (1972). Maximum likelihood approaches to causal analysis. Ph.D. thesis, University of Chicago. 6 Wiley, D. E. (1973). The identification problem for structural equation models with unmeasured variables. In A. S. Goldberger & O. D. Duncan (Eds.), Structural equation models in the social sciences (pp. 69–83). New York: Seminar. 5
3 Peter Bentler
25
quantitative issues that were raised by that kind of data. But even in the everyday practice of structural equation modeling, both theoretical and applied, people have real problems that they want to apply these methods to, so I encounter a lot of real data in all kinds of fields, but more through my teaching than through my own personal work. What are the three most dominant topics in your work? One of them has been what I would call “psychometric theory,” classical test theory. And my first paper after graduate studies was a variation on classical test theory, on error measurements.7 My first publication in Psychometrika was related to that and I have worked on those issues on and off, but not very frequently across the years. But then, just last year, I had another publication on exactly that topic, reliability theory, and specifically, the relationship of internal consistency reliability to actual measures of reliability and showing how they’re different and what the issues are there.8 So that has been a continuing interest of mine. The other interest was structural equation modeling and how to think about it, how to think about those kinds of models, and the statistical aspects of those models. I don’t know about a third… Well, a third would be something that I puzzled about a long time ago and wrote a paper about in the early 1970s9 and then gave up on. I came back to it about 10 years ago and gave some talks about it, but I still haven’t formally written up. It is the idea of Guttman scaling and how to approach that quantitatively, but not from an IRT perspective. I have an alternative approach to that which I call an absolute simplex, which has a statistical development and a quantitative development. That’s something I still need to write down and work out and I’ve been kind of putting it off because I’m not satisfied with my full understanding of that. Your psychometric work was that mostly done by yourself, whereas the work on drugs abuse was more in collaboration with others? Well, my goodness, no, I owe most of my success to my students and my collaborators. I have been fortunate to find students and collaborators from around the world who have taught me things and then have been able to work with me on topics and produce those. So most of my papers, except the one I just mentioned about internal consistency last year, are multi-authored, with students or collaborators. Sometimes, I’m the first author; sometimes I’m the second or even third author. So it depends on the particular paper. I’m in this field because I love it, and I’m extremely fortunate to have met wonderful people who also love certain aspects of it and that I can have
Bentler, P. M. (1964). Generalized classical test theory error variance. American Psychologist, 19, 548. 8 Bentler, P. M. (2017). Specificity-enhanced reliability coefficients. Psychological Methods, 22, 527–540. 9 Bentler, P. M. (1971). An implicit metric for ordinal scales: Implications for assessment of cognitive growth. In D. R. Green, M. P. Ford, & G. B. Flamer (Eds.). Measurement and Piaget (pp. 34–63). New York: McGraw Hill 7
26
3 Peter Bentler
a lot of fun exploring things with and find new topics that are challenging. I probably had 20 or 30 collaborators across the years. There are still challenges in psychology, problems that have no solutions yet, is that also what appeals to you so much, that there’s still work to be done? I’ve considered statistical inference to be a big problem over the last 20 or 30 years, I’m continuing to work on those things that I have been interested in and trying to make little improvements here and there. I don’t know if that’s the right modus operandi, I can’t say that; many of my collaborators come from mathematics or from statistics or from some other field, some have come through psychology. Do you think that psychometrics can learn from those fields? Do you have the feeling that people from other fields can bring something to psychometrics that we don’t have? Yes, I think there’s no question that good useful theoretically interesting technical things that have been developed in all kinds of fields are relevant to the kinds of the things that are done in psychometrics. That’s because, in my view, methodological quantitative challenges in the social and behavioral and educational sciences are similar. On necessity, there has to be an interchange between these areas, perhaps there could be more than there is, but it has been important and will continue to be important in my opinion. In recent time, statistics is becoming more important than it has been in the past and probably computer science as well. As you said earlier, you consider yourself a psychologist, more so than a psychometrician? Or both equally? I also probably consider myself an applied statistician, and in fact I have an appointment in the statistics department here also, since that department was founded, which was in 1998. I would consider myself an applied statistician also. Being part of both, what do you think should be the relation between psychology and psychometrics? That’s a hard question to answer, because partly it depends on what you define psychometrics or psychology to be. For example, the American Psychological Association has started supporting the idea of quantitative psychology, and you mentioned you’re from a methods department, right? Is psychometrics necessarily doing all kinds of quantitative psychology? No, probably not. If you interpret psychometrics to mean more than measurement focused things, of which factor analysis and classical test theory and item response theory are very good examples, there are all kinds of other quantitative and technical issues that need solving for psychology to advance as a science that are not strictly psychometrics. The oldest known
3 Peter Bentler
27
distinction is the old Cronbach10 distinction between the two disciplines of scientific psychology. Well, I happen to be interested in the differential individual differences type of psychology and not the other one, but the other one is hugely important. Are the differential psychology related methods the best methods for those other ones? Probably not, but they’re probably relevant, and there are certainly areas of overlap. Should the Psychometric Society become a broader institution for that matter? It probably is already. And I think the journal is reflecting that, as time goes on, the kinds of articles that appear are broader. UCLA, the University of California, is very bureaucratized. Every so often there are personnel reviews and you have to put your materials in and there’s a committee that evaluates you. They write a report about your work and that leads to a recommendation for a merit increase or not. Anyway, I remember one of these in the statistics department; the authors of the report wrote something like “he publishes in second-tier journals.” For them, Psychometrika is a second-tier journal, for me it’s not. For me it’s a primary outlet, and one of the better outlets in quantitative psychology as I see it, so I didn’t take offense of that, but it’s a perspective. Statisticians wouldn’t consider Psychometrika a first-rate journal. No, they want to see things in Annals of Mathematical Statistics or in JASA. Given that you’ve worked with statisticians and you work in the statistics department, you probably know quite a few statisticians; how do they look at psychometrics in general? I don’t know that they have any special view; it may depend on the particular department and who supports what, and the time period involved. For example, at Stanford, Ingram Olkin was a multivariate statistician, but he had huge interests in educational statistical issues. That was a focus, which in his case led to developments like meta-analysis, which are extremely useful. Would I consider them to be psychometricians exactly? Maybe, maybe not. Are they useful and relevant in psychology, to the same people who are doing substance abuse research? Yes, it is relevant! But it’s a different kind of data. There are lots of research methods, and I wouldn’t say those things are all classical psychometric methods, but they’re relevant. Should the journal Psychometrika publish things like that? Yes, I think it should! Psychometrics should be considered more broadly than just measurement in my opinion, or at least, whatever quantitative psychology is going to be in the next few years.
Cronbach, L. J. (1957). Two disciplines of scientific psychology. American Psychologist, 12, 671–684. 10
28
3 Peter Bentler
So that could involve brain data as well? Well, I think that psychology has not played the role it could’ve or should’ve in the last 15 years of the development of genetic methods and biochemical measurements, especially in the relation between those measurements and more psychological phenomena. If anybody is studying the interrelations, you would think it’s the psychologists who could do it from this end as well as the biologists looking at it from that end. But you tend to see the biologists and you don’t tend to see a lot of psychologists working on that, and I’m discouraged that psychology hasn’t had more of an emphasis on that. We had an earlier little discussion on the kind of quantitative methods that are relevant to psychology. Well, social psychologists and clinical psychologists are nowadays getting a lot of brain wave data, but psychometrics hasn’t had a lot to say about that. A couple of papers here and there, but not a whole lot. If someone is interested in brain wave data in psychology, would they now study psychometrics? Maybe not, they need to study MATLAB; they need to study various kinds of methodologies that they’ve developed there. Psychometrics as traditionally developed historically deals with the kind of data that was available in the 1900s, the 1950s, and 1980s, but now, all kinds of other data are available, and it’s not clear that we as a profession, the psychometrics profession, are at the forefront of those kinds of data. Do you think psychometricians are scared of those developments? I don’t know, how can I say this. In psychology, historically, if you go back to Thurstone or people like that, they were interested in substantive psychological issues and found the methods that would help them deal with those. For them, the problem came first. I think what’s happening with fMRI and other developments is that people who’ve worked on those methods are trying to develop and deal with the kind of data that they have and are looking to wherever they can find methods that will work. So sometimes cluster analysis has been rediscovered over in bio- informatics, even though it has its own history in psychology and for that matter a whole separate society that your advisor Willem Heiser is involved in, the Classification Society. People with substantive interests often try to find the areas that would help them do things they want to do, and I suspect old guys like me don’t have those kind of new data sources and so we haven’t become interested in those and so we’re trudging along. It takes a newer generation to say, “this is really what I’m interested in and where can I get the kind of techniques that I need in order to do that optimally.” Should the Society or any other institution for that matter, encourage that kind of research? I think they should, and they do. At UCLA, we have a very good social psychology program, and one of the young people became interested in social neuroscience, or applying neuroscience to social psychological problems. If he would’ve needed traditional psychometrics, he probably would’ve talked to us in the department that do
3 Peter Bentler
29
that kind of stuff, but he didn’t need those methods for the problems and data he is dealing with. Having said all that, many of these other fields are discovering the same kind of issues that we’ve dealt with, which are, for example, errors of measurement, bias in parameter estimates. There are things that are universal across quantitative methods no matter what the application. I have no prescription what psychometrics as a discipline should do; I think it will depend on who’s doing what and it will evolve naturally. Different people have different particular histories, I have a particular history, I got involved in certain things and looked at methods related to those, but other people have other interests and will find other methods, or will borrow or reinvent methods. So, what will you be remembered for? I hope I’ll be remembered by my students! What work do you consider to be an important contribution? Well, I’m pretty happy having thought about a different way to think about structural equation modeling than the one that existed at the time, which was the LISREL approach. I think the so-called Bentler-Weeks model11 is a contribution that will stay around, because it’s a nice simple way of thinking about things. But I did it because I enjoyed trying to figure out how to think about it and hopefully it will be useful. It will stick around if people find it useful, and if people don’t find it useful, it will get dumped and that will be that. Do you think it has found its way to applied researchers? I would say to some extent. Because these things are so heavily multivariate, applied researchers require computer programs, so I think it all depends on what computer programs present will be used on a daily basis in the future. I don’t know that the Bentler-Weeks model is going to be around in that sense; it depends on what people put in their computer programs. But you wrote a computer program, right? I wrote a computer program, the EQS program,12 which may be around for another couple of years, but then it will probably die out. I developed EQS for teaching purposes, not as a commercial thing, so when I’m gone, EQS will probably be gone.
Bentler, P. M. & Weeks, D. G. (1980). Linear structural equations with latent variables. Psychometrika, 44, 289–308. 12 Bentler, P. M. (1985). Theory and Implementation of EQS, a structural equations program. Los Angeles, CA: BMDP Statistical Software. 11
30
3 Peter Bentler
Is there work you’ve done that didn’t reach the audience you hoped it would? No, I don’t think so; whatever I did in general has been reasonably well received. That doesn’t mean that there weren’t huge battles sometimes getting them into print! What was your biggest battle? I have no idea which one was the biggest, but my most recent battle I mentioned earlier; that was my publication from last year on internal consistency coefficients and issues associated with them as measures of reliability.13 There was a reviewer who felt that this paper absolutely should not see the light of day because it would mislead people so terribly that it would be a horrible mistake for it to be published. Those are some strong words. I’m appreciative that the editor didn’t totally agree with that person; it got published. Now, it may be that that guy was right and that’s exactly what will happen, but I’d like to have expressed what I did, and if people think it’s trash, let it be trash, that’s fine! I’d rather have people saying that it’s trash than saying “don’t let anybody think about this”; that’s the part I don’t like. There are people out there who are very determined that their way of thinking about things is the only way, and I’m pretty open to letting all different kinds of ways of thinking be out there and then the ones that will be useful will hopefully take precedence. You already mentioned some people that have advised you, that you worked with, especially in your early career; who else was a great inspiration to you? Inspiration to me? The people who encouraged me are maybe different from the people who were inspirations to me. For example, Henry Kaiser was one; he was not at any place where I studied but I started corresponding with him as a grad student, and he was very kind in responding. He was encouraging, and that meant a lot to me. Not that I had answers but he thought the kind of striving I was doing could be directed in a certain way, and that could be useful. Anyway, it was very encouraging so I liked that. Intellectually, I loved Karl Jöreskog for taking things that had been around for a long time and making very clever wrinkles on them. He wasn’t the first but the most successful person who initiated Confirmatory Factor Analysis. Then there was the idea of “why do all parameters have to be free, why not fix some,” a very simple but very inspiring idea, which turned out to be very useful. And, I have to say again, all of my students and collaborators have helped me beyond anything.
13 Bentler, P. M. (2017). Specificity-enhanced reliability coefficients. Psychological Methods, 22, 527–540.
3 Peter Bentler
31
And, historically speaking, what do you consider the most important psychometric work? Psychometrics as opposed to quantitative, that’s tricky, but I’ll stick with psychometrics, because I do identify with it in part: Spearman and Thurstone probably. And in quantitative psychology at large? I don’t know, it’s hard to say; the field is so huge! Very influential was the idea of errors in measurement, which of course had been around for a long time in astronomy—it’s not like Spearman invented it—but Spearman thought about it in a way that made it relevant to psychological measurement. And I thought that was really fabulous. In another measurement way, virtually only a few years after that, Sewall Wright developed his path analysis.14 He’s not a psychometrician though. Karl Jöreskog, whom I’ve mentioned before, is not a psychometrician; he was a statistician. He has a publication or two or three that I would call psychometric publications, but most of them aren’t psychometric publications. It’s the unification of the fields that has been very important and inspiring. Anyway, Sewall Wright’s contribution for showing how to think about equations in a visual way, in a very convenient and simple way, was a brilliant idea that came out of nowhere. And what do you consider psychometrics’ biggest achievement? What has it done for the world? In the early days, intelligence testing, more recently I would say computerized adaptive testing, when we talk of psychometrics in the narrow sense. Do psychometrics and quantitative psychology blend? There is the intuition that psychometrics and quantitative psychology aren’t the same, but is that an old-fashioned distinction? I don’t know; personally I think of psychometrics more as measurement and quantification of individual attributes, or the theory thereof, and quantitative psychology involves mathematical models. But there’s a whole field of mathematics, and because we’re always sampling, there’s a lot of statistics involved, and nowadays, because of sheer quantity, you cannot exclude computer science: they’re all relevant. Presumably, more people see the need for pulling in things from one field, either this way or that way, they’ll do it as they see the need. What do you think psychometrics will be like in 50 years? I have no idea.
Wright, S. (1934). The method of path coefficients. The Annals of Mathematical Statistics, 5, 161–215. 14
32
3 Peter Bentler
No speculations? I can’t project, I’m sorry. What is your biggest hope for psychometrics? What do you think is a problem they still have to solve? A lot of methods that are relevant to psychometrics, including factor analysis and structural equation modeling, have, statistically speaking, been based on the random sampling of people from a population and drawing inferences to the population. In that setup, the number of variables is usually small relative to the number of observations. But the new world of data goes two ways. There’s a “Googlization” where you could still deal with a small number of variables and take the number of subjects to infinity so there’s almost no need for statistical inference, because you’ve got samples of hundreds of thousands. But you also have the case, as in a lot of genetic data, where the number of actual observations is relatively small but the number of variables is extremely large. And classical statistical inference isn’t made for that, and we have to solve how to do a factor analysis in a situation where N is tiny and the p is huge. What does it mean to do a factor analysis in that situation? There are people working on these things, but that involves issues that haven’t been totally resolved. Some of those issues have historically popped up in psychometric writings with regard to bouncing betas in regression or just to stick with the regression example, with regard to biases in R-squared coming from undersized samples. It’s a generic issue, and one that needs a lot more attention, because a lot of more data are coming in that way. And as I said, with the number of variables getting so large, even if you have a huge dataset, how do you do a factor analysis on 20,000 variables and half a million subjects? We’d probably don’t want to do the singular value decomposition, we probably don’t want to calculate the covariance, but what are alternative ways to deal with that in an effective way? You’re probably going to have to pull in a lot of computer science knowledge as to how to pull in that data efficiently so your computer doesn’t get clogged with all these numbers. Of course, the computers are getting smaller and can handle more, but the point is we’re in need of newer methods to deal with these issues than those that have traditionally been used in this field. Do you think that psychometrics really has to keep an eye on computer science? We have to keep an eye on computer science; we have to keep an eye on statistics. But then it depends on your substantive interest: if you’re really interested in individual differences, and the biological basis of individual differences, you’ll have to learn a lot of biology, you’ll have to learn a lot of methods that integrate these two fields. For someone starting out, there are lots of challenges available. There are still plenty of opportunities.
3 Peter Bentler
33
You’ve been working here for a very long time and you’re still doing a lot of work, what are your personal plans for the future? Well, I still don’t feel satisfied with the kind of statistical inference from non-normal data that we have. I’ve had some success of dealing with it, for example, with Albert Satorra, on a variety of test statistics,15 but I still feel that we could do better than what we currently have. I would like to get back to my old problem, about a version of the Guttman scale, which, as I’ve mentioned, I’ve called an absolute simplex.16 One version of the Guttman scale only has one parameter. Another one parameter model is the Rasch model: one parameter per item. The question then is: how could one compare a Rasch model to a model like this, not in an IRT framework, where the method of evaluating this model and the method for evaluating that model are completely different. If they’re different methods, how can you compare them? How can you decide if one is better than the other? That’s still bothering me, I don’t feel I have a good enough answer to when you have things that are not commensurate but somehow you want to compare them anyway. I don’t have an answer yet, and that’s leaving a challenge for me to still work through and write up and finally find what I consider a reasonable approach. That’s lying around on your desk. It’s lying around and I think about every once in a while, then I put it away again! And you’re also still advising students, right? I have not taken graduate students for a few years, but I just got an e-mail from a student in statistics, who has become interested in latent variable modeling and might want to have somebody to work with. So I can work with him, but only under the condition that they know I’m probably going not to be around to finish their Ph.D. thesis. I may be, I may not be. I’m in reasonable health, thank goodness, and I’m planning to stick around, but because I can’t guarantee I will be healthy for another 5 years or whatever, I’m not taking new students. Are you still eager to keep working as long as you can? Absolutely. And I hope I have the good sense to stop when I don’t make sense anymore.
Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and standard errors in covariance structure analysis. In A. von Eye & C. C. Clogg (Eds.), Latent variables analysis: Applications for developmental research (pp. 399–419). Thousand Oaks, CA, US: Sage Publications Inc. 16 Bentler, P. M. (2013). Reinventing “Guttman Scaling” as a statistical model: Absolute simplex theory. Presentation, Leiden University. 15
Chapter 4
Larry Hubert
“I don’t necessarily work in psychometrics proper, I may be a tough subject for you here.” Lawrence, or Larry, Hubert, emeritus professor of psychology, statistics, and educational psychology at the University of Illinois, was president of the Psychometric Society in 1983. Hubert earned his Ph.D. under the supervision of Patrick Suppes in 1971 at Stanford University. His research interests are in cluster analysis, combinatorial data analysis, ethics and statistics, and probabilistic reasoning.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_4
35
36
4 Larry Hubert
Thank you for being my interviewee in this oral history project on the history of psychometrics, Larry. In this interview I’ll be asking questions about three themes. First of all, your career as a psychometrician, the relation between psychology and psychometrics, and the history and future of psychometrics. And let’s just start with the question how you ended up in psychometrics? Okay, but let me begin by noting that I don’t refer to myself as a psychometrician. That’s an interesting start. I don’t necessarily work in psychometrics proper; I may be a tough subject for you here. That’s fine! If I had to characterize who I am, I would say that I’m a quantitative methodologist. I’m working in psychology but I’m also working more generally in the social and behavioral sciences. I’ll talk a little bit more later about how I would view quantitative psychology in general as opposed to psychometrics in particular. That would be interesting. So how did you end up where you are? I guess the short answer to the question of how I ended up where I am is that it was unplanned and accidental. And this might not be unusual for a lot of people; they end up in careers that were unexpected when they started, but because of various life circumstances, they went certain ways. The impetus for where I ended up goes back a long way, to an occurrence that happened a lot longer ago than you were born. This goes back to the 1950s. In 1957, I was just beginning the eighth grade; this was an era when I was worried about eating pizza and drinking beer and things like that. The Russians did one thing at that point; in October of 1957, they launched Sputnik, the satellite. Up to then I had planned on entering the next year into one of the trade programs in high school; in particular I had my eyes set on being a cabinetmaker. But after Sputnik, there was an incredible national mobilization in the United States for science education, and before the end of 1958, probably every junior high school student was tested. It was a massive program. Now this was my first encounter with psychometrics; my guidance counselor politely told me after I had gone through the testing that my plans on being a cabinetmaker were over. Instead, I found myself doing three things in the next several years. First, I attended a week-long workshop sponsored by NSF, NSF standing for the National Science Foundation. This was a program in science and mathematics at St. Olaf College. This was right after the end of my ninth grade in the summer of 1959. There was a large group of us from Minnesota, and some from the Midwest area in general, who stayed there for a week, attending chemistry, biology, maths, and physics lectures of various kinds; the NSF hoped to stimulate an interest in these areas, so the United States could eventually best the Russians. So that was one thing that I did.
4 Larry Hubert
37
After my tenth grade, I went back to St. Olaf, again because of the National Science Foundation, and did freshman biology. This was my second encounter with psychometrics; when I went to do the subject matter exam for college and placement in biology, I had a perfect score. In fact, I remember that I did not even need the alternatives. They asked a question, I knew the answer; I didn’t need to eliminate the distractors. That was the second encounter with psychometrics. In the 10th to the 12th grade, at my high school in Duluth, they had instituted a new math program and this was called the School Mathematics Study Group. There was an acronym, SMSG, people later on said that it stood for “some math, some garbage”; but it was a very intensive program that was developed at Yale, by a person with the name of Ed Beagle. I spent my senior year studying calculus, so I was pretty well prepared when I went to college. I ended up graduating Valedictorian of my class, and because of that I received a local Duluth scholarship. It was given by somebody who was a benefactor of a scholarship fund for Duluth students, and that gave me completely free entrance to Carleton College, which was also in Northfield, just like St. Olaf. I graduated in 1966, with a degree in mathematics along with a teaching credential; I had plans to teach high school mathematics, at least that’s what I thought I was going to do. My Carleton department at that time had nominated me for a National Science Foundation Fellowship at Harvard, which I took and I went there in 1966/1967. I received a Master of Arts in the Teaching of Mathematics, but unfortunately, I couldn’t find a job teaching high school mathematics in the Northeast. So I needed a job. In the year before I went to Harvard, I had applied for and received a US Office of Education Research Traineeship at Stanford, so needing a job I asked Stanford to reactivate the offer. They did and off I went. The program in education was called “Mathematical Studies in the Education Processes” and Pat Suppes was my major professor. I wrote a thesis on modeling perceptual processes in geometric form perception. You interviewed somebody this morning who was also a Suppes student, Paul Holland. He was not in the program that I was worked with Pat on semi-orders if I remember correctly. Two other people that were part of the program that I was in were Ingram Olkin, who just died very recently, and Lee Cronbach; if you are in psychometrics, you probably have at least heard Lee’s name. One of the things that I did with Cronbach, and this was in the late 1960s, was in a class that Lee taught. Everyone in the class got the galley proofs for Lord and Novick; we went over them in detail and corrected them for Lord and Novick! This was quite an immersive experience in psychometrics I must say. When I was finishing my doctoral degree at Stanford, I was looking for a position in mathematical modeling. Unfortunately, there were none that were available, so this is getting to be an old story. Originally, I wanted to teach mathematics at high school, but there were no jobs available; now I wanted to be a mathematical modeler, but again there were no jobs available. Along the way I had done a lot of statistics, so Cronbach steered me to quantitative jobs in educational psychology, and I got one in 1970 at the University of Wisconsin in Madison. When I went there, my two main colleagues were Frank Baker, who was a well-known psychometrician, and Anne Cleary. She met a very early death at the University of Iowa from a disgruntled graduate student. There was another colleague there, his name was Chet
38
4 Larry Hubert
Harris. Chet had been an editor of Psychometrika and president of the Psychometric Society, so he was a very well-known individual. He had left for Santa Barbara the year I came; he wanted to go out and be in the California sun, so left in 1970. Now, it turns out that I actually followed Chet out to Santa Barbara in 1977. That’s about it. I went from Santa Barbara to Illinois psychology in 1987, and I stayed at Illinois till I retired last year. I’ve worked throughout this time on quantitative topics in psychology and in educational psychology, but not in psychometrics per se. So, there is very little test theory you’ll find anywhere. At one time, I did teach a sequence out of James Algina’s and Linda Crocker’s book,1 but that’s the extent of my exposure to teaching pure psychometrics. Do you see test theory as proper psychometrics? Yes. Lord and Novick is psychometrics. Classical test theory is psychometrics. Concepts of reliability, validity, splitting up the kinds of validity you have, that’s psychometrics. As you mentioned, I interviewed Paul Holland, whose advisor was also Suppes, but you had a very different topic than he did. Can you explain something about your dissertation, even though it’s not psychometrics proper? My thesis as I said was in geometric form perception. So I set up a model, I estimated parameters, and did the fit using data collecting tachistoscopically from fourth graders. I formulated the model using the notion of figural goodness, in a certain way. So it was a very traditional mathematical modeling topic. My thesis was published as an article,2 in the Journal of Mathematical Psychology, not in Psychometrika. It would not have been appropriate for Psychometrika really. Because it had nothing to do with test theory? If you think about the journals and the way that they’re split up in our area: some are devoted to what I would call test theory proper. For example, the Journal of Educational Measurement would be what I would consider a test theory journal per se. However, Psychometrika is not necessarily just a test theory journal. But that is the flagship journal of psychometrics, right? Let me go back and make a distinction. I’m going to break up the field of quantitative psychology into three parts. Just like Gaul was broken up, remember, you’re a literature major, right? Okay, so if I looked at a Psychology Department that had a quantitative division, I would expect to see—if this is a well-stocked quantitative Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. Orlando: Holt, Rinehart and Winston. 2 Hubert, L. (1972). A statistical method for investigating the perceptual confusions among geometric configurations. Journal of Mathematical Psychology, 9, 389–403. 1
4 Larry Hubert
39
department—three parts, one would be mathematical modeling. If you want to think of a Psychometric Society president, the epitome of that is Duncan Luce; he was a craft theorist and the quintessential mathematical modeler. Could you give a definition of mathematical psychology? Mathematical psychology? I can make a distinction between quantitative psychology and mathematical psychology. Mathematical psychology is more the modeling variety; their flagship journal is the Journal of Mathematical Psychology. There’s very little psychometric-related material in the Journal of Mathematical Psychology. So, let me go on to my three parts. Mathematical modeling—Duncan Luce. Secondly, there’s behavioral statistics. A Psychometric Society President in behavioral statistics would be somebody like Quinn McNemar. You probably know “McNemar’s test of correlated proportions.” He was a Stanford faculty member for many years in education, but joint with statistics as well. Quinn McNemar was a student of Louis Terman. Terman would be much more on the psychometric side; he worked on the Stanford Binet3 and other tests. The third part, as I would say, would be test theory, and a Psychometric Society president who would epitomize that would be Fred Lord, from Lord and Novick. Mel Novick would also be a test theory person, except that Novick actually dabbled a lot in Bayesian statistics. One of the things you claim is that the Psychometric Society is solely aimed at psychometrics. Well, that’s a statement you can agree or disagree with. I disagree. Let me give you the reason, or an argument. There’s one phrase that was the impetus or the starting point both for the Psychometric Society and Psychometrika, and this statement was on the cover of Psychometrika till 1984, I quote: “A journal devoted to the development of psychology as a quantitative, rational science.” There is no mention of psychometrics whatsoever. And in fact, if you go to the very early Psychometrikas, probably up until the sixties, you see much more in the way of behavioral statistics and not much modeling, but a lot of factor analytic material, principal components analysis, multidimensional scaling, individual differences scaling. It is since the publication of Lord and Novick, which finally introduced formal modeling to the field, that you saw a movement in Psychometrika, a preponderance of material that could be considered to be more traditional psychometrics. So you have all the structural equation modeling, as a dominant focus of Psychometrika, but when Psychometrika started out, this was definitely not the case. Thurstone’s idea of Psychometrika and the Psychometric Society was how we should do all psychology mathematically, quantitatively. So, I go on from there to say that you could actually subdivide all the presidents of the Society into our three groups. Most have long since died so you’ll only get to speak to the living ones of course. Terman, L. M. (1916). The measurement of intelligence: An explanation of and a complete guide for the use of the Stanford revision and extension of the Binet-Simon intelligence scale. Boston: Houghton Mifflin. 3
40
4 Larry Hubert
Yes, I’m too late for many of them. If you had a list, I could probably categorize every president as either a modeler, a behavioral statistician, or a test theory person. If you go back to the first president, Thurstone, he did some work in classical test theory, primarily on mental abilities, but it was all done through the guise of developing the factor analytic model, and that’s maybe more closely affiliated with multivariate analysis than it is with psychometrics. At the time people kind of conjoined interest in principal components with factor analysis, but principal component analysis is strictly behavioral statistics. Behavioral statisticians rarely develop a model; they’re interested in very earthy least-squares objective functions, so most people would never say principal components and psychometrics are one and the same. Early Psychometric Society presidents, for example, Guilford, did some psychometrics in his subdivisions of the intelligence area,4 but he was also a statistician and wrote books on educational statistics.5 I study a person by the name of Truman Lee Kelley. He wrote some books on classical test theory,6 but he also has a major text in statistics,7 so he would but not be a modeler; he would be a behavioral statistician with some dabbling in classical test theory issues. So if you go back, you find there are not many modelers who have been president of the Psychometric Society, except for Duncan Luce. What I’ve always heard is that there was a very clear moment when psychometrics and mathematical psychology went their separate ways and that there were no longer mathematical psychology papers in psychometrics since. Right. My advisor, Patrick Suppes, was the major instigator in that. Pat Suppes was part of that movement as well as Duncan Luce and David Krantz. They moved away because Psychometrika was not allowing publications in that area. If you look at the first several volumes of Psychometrika, you’ll find some articles on modeling. For example, Gordon Bower’s one-element model8 is in there and a couple of other articles. Later on, Psychometrika refused to, or didn’t allow, articles on modeling. The mathematical psychology people got very upset with Psychometrika and said, “Adios suckers” and left.
Guilford, J. P. (1968). Intelligence has three facets. Science, 160, 615–620. Guilford, J. P. (1950). Fundamental statistics in psychology and education. New York: McGraw-Hill. 6 Kelley, T. L. (1927). Interpretation of educational measurements. New York: World Book. 7 Kelley, T. L. (1947). Fundamentals of statistics. Cambridge: Harvard University Press. 8 Bower, G. H. (1960). Properties of the one-element model as applied to paired associate learning. Psychometrika, 26, 255–280. 4 5
4 Larry Hubert
41
But what was the first reason for this anger? There was a paper by Luce and Tukey, the first paper in the Journal of Mathematical Psychology,9 which caused the uproar. Psychometrika rejected the paper, and then they decided to go somewhere else? Or did they decide to go somewhere else because they weren’t so fond of Psychometrika? Psychometrika refused to publish some of these papers on modeling. And the people that were submitting these papers were very influential: Suppes, Luce, Tukey. And they decided to form a new journal, the Journal of Mathematical Psychology through Academic Press. I think the first issue of the Journal of Mathematical Psychology was in 1964; you can go and take a look at those particular pieces that appear in volume 1, see the papers that were rejected by Psychometrika. They figure very prominently as lead articles in the first issue. After that point, there was an enormous amount of ill will. I can imagine. There was a lot of ill will between those groups. I guess it’s still relevant. The one journal that has been much better at combining some of the modeling work and some of test theory and behavioral statistics has been the British Journal of Mathematical and Statistical Psychology. And in fact, even though I’ve published a lot in Psychometrika, I’ve published more in the British Journal. This is hard as a former editor of Psychometrika to say, but I’ve found the British Journal to have a much better review policy. Some of the reviews that you get when you submit to Psychometrika are just stupid and brutal, frankly. I always had much better feelings about the British Journal than I did about submitting to Psychometrika. Actually, I edited another journal, the Journal of Educational Statistics, before I was editor for Psychometrika, so I was always more into educational and behavioral statistics than I was into psychometrics. Your advisor was more of a modeler than a psychometrician. Do you recognize that in yourself? Or have you gone in a completely different direction? I did one modeling paper and that was in the Journal of Mathematical Psychology in 1972; that was my thesis. I’ve never done anything else, that was it. We’re done. You’re not a mathematical psychology person. No. One question you have is in what areas have you worked? You ask for three, I’ll give you four. Luce, R. D. & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new type of fundamental measurement. Journal of Mathematical Psychology, 1, 1–27. 9
42
4 Larry Hubert
Great! When I went to Wisconsin, I worked on cluster analysis, and consistently for many years. I made a gradual transition to a very general area, which I gave a name to: combinatorial data analysis. Combinatorial data analysis involves things like multidimensional scaling, graph theory, and many other computational things. More recently, I’ve moved into ethics as it applies to statistics. In fact, I wrote a book with another psychometrician, not a president, but a career award winner of the Psychometric Society, Howard Wainer, which is called A Statistical Guide for the Ethically Perplexed.10 It takes off on Maimonides, and there is a little discussion in there as to why Maimonides is relevant to all of this. So that’s three. What I’m working on right now is what I would call applied probabilistic reasoning. It’s about how to reason probabilistically by looking at data not necessarily in the form of numbers, but in the form of information. Being Dutch, you don’t know about the O.J. Simpson trial, do you? I do! I have some material that I’ve started developing that begins with the O. J. Simpson trial. Simpson’s defense attorney, Johnny Cochran, had a famous quip, probably the most famous one in all of jurisprudence. There was an issue of O. J. Simpson not being able to get on the supposed glove; it wouldn’t fit, and this glove was found at the murder with all sorts of blood on it, and the prosecution stupidly had O.J. Simpson try on the glove. So if you Google “O. J. Simpson trial fitting of glove,” you will get this incredible theatre, and in this closing argument Johnny Cochran made the statement: “If it doesn’t fit, you must acquit.” An if-then statement. I take that particular statement and do a little take off on it about reasoning. So you have the evidence of the glove not fitting, what does that say or not say about the event of an acquittal? I also have one in terms of Cinderella; you know the story of Cinderella, right? I do know the story. I watched a lot of Disney movies when I was a kid. This Cinderella story is a little risqué. When midnight comes along, she runs out, drops one shoe, keeps the other one. However, not only does she drop a shoe, she drops all her clothes, the tiara comes off, the bra, the panties, everything. So, when the prince comes to decide who Cinderella is, besides the slipper, he has all of these clothes. You have a sequence of probabilities: every cloth item, if it fits, is facilitative of this being Cinderella. So, the event that you’re interested in is whether this person is Cinderella. If the shoe fits, the probability of this being Cinderella increases, if the tiara fits, the probability increases, if the bra fits, the panties fit, the dress fits, until you might get to the point, in a legal sense, when it being Cinderella is “beyond a reasonable doubt.” You can probably never get to “beyond a shadow of a doubt,” which is essentially probability one, unless Cinderella has happened to
10
Hubert, L. J., & Wainer, H. (2012). A statistical guide for the ethically perplexed. CRC Press.
4 Larry Hubert
43
keep that second glass slipper that was exactly the same in size and then came out with it. We call that a smoking gun; it’s something so definitive, all doubt goes away. I started this research line with these kinds of stories. You use the stories in a research context, or are they examples of what you’re researching? It’s a way of motivating formal concepts in terms of conditional probabilities. For example, we all get screened at various times for things: after a certain age, males get screened for prostate cancer, women get screened for breast cancer, and they want us to do these screenings on a continuing basis. So, the test indicates whether you have cancer or not, and the state of nature is whether you actually have cancer or not. In diagnostic tests you have a number of concepts: specificity, sensitivity, the positive predictive value, the negative predictive value. The question is: if I test positive for prostate cancer, what is the probability that I actually have prostate cancer? And this is the kind of probabilistic reasoning that follows from O.J. Simpson and Cinderella; those are the things that I’m interested in. Would you characterize this is as psychological research? It’s really more behavioral sciences. And it doesn’t have to be behavioral; it can be historical as well. But it’s the idea of how the information about the occurrence of something affects your judgment about the occurrence of something else. There are many examples of this. There is the whole back and forth about killing Osama Bin Laden and the assessment whether or not he was at the compound when they went in with SEAL Team Six. Those are the kinds of examples where applied probabilistic reasoning comes in terms of decision making: whether or not you need a biopsy, whether or not you should send in SEAL Team Six, whether you should wear a raincoat on a particular day, and so on. That’s quite different from traditional psychometrics. It’s very different! If you look at the history of psychometrics, or perhaps the history of behavioral statistics, is there one work you were very much inspired by? For my own research? It depends on the area. I guess my first interest in cluster analysis was stimulated by some work that Stephen Johnson published in 1967 in Psychometrika on hierarchical clustering.11 It was an article that first introduced the idea of an ultrametric as a characterization of a hierarchical clustering scheme; it really codified my interest in that area, and a lot of what I did in the 1970s was a direct result of that particular piece. That piece was heavily influenced by two Psychometric Society presidents who were also at Bell Labs at the time that Johnson
11
Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32, 241–254.
44
4 Larry Hubert
wrote it at Bell Labs: one was Douglas Carroll and the other was Roger Shepard. Roger Shepard is still alive; you might be able to go to Arizona and meet him! The Johnson paper was really a stimulation for that whole area of combinatorial data analysis and my interest in optimization. A few years after I went to Wisconsin in 1970, I got what was called a Social Science Research Council Fellowship, which allowed me to take a full year off teaching. I just took classes at Wisconsin, including George Box’s class on time series, Norman Draper’s class on regression, and all the optimization courses available in engineering. My work in combinatorial data analysis was really defined by all of that work. In fact, there are two SIAM monographs written by me and two other Psychometric Society presidents, Jacqueline Meulman and Phipps Arabie. The 2001 monograph12 was about social behavioral science applications of dynamic programming and is entirely about combinatorial data analysis. The 2005 monograph13 was on proximity matrix representation. A lot of that work was stimulated by my interest in optimization. A request from Abigail Panter and Sonya Sterba stimulated my interest in the interface between ethics in statistics. They were developing a volume called Handbook of Ethics in Quantitative Methodology14; you should probably take a look at that since you are interested in history. They asked me to submit a chapter that dealt with ethics in statistics and I enlisted Howard Wainer to help me on this chapter. I went through notes for what I had done in my teaching over 40-some years, and I had put in a lot of examples where statistics was used inappropriately or unethically. This involved some of the probabilistic reasoning material as well as a number of legal and medical contexts. I put together what I thought was a very nice chapter, albeit a very long one. Once that chapter got published, I still had more material, so that got turned into that book. As I said, I’m working on the applied probabilistic reasoning for now, and that interfaces directly with the newer material on regression trees, optimization, and cross-validation ideas. In fact, the thesis of my very last student was on the prediction of violence and dangerous behavior, and it involved classification and regression trees. When he graduated, he got a job not in an academic area but in data analytics for the Los Angeles Dodgers. There is not much of that published in Psychometrika! What you will be remembered for? I’ve written a lot of papers as you can see from my vita, but I’ll probably be most remembered for the two Hubert, Arabie and Meulman SIAM monographs, and this last book that I did with Howard Wainer. And who knows what is more to come; perhaps things might arise out of this applied probabilistic reasoning direction.
Hubert, L. J., Arabie, P., Meulman, J. J. (2001). Combinatorial Data Analysis: Optimization by Dynamic Programming. Philadelphia: SIAM. 13 Hubert, L. J., Arabie, P., Meulman, J. J. (2006). The structural representation of proximity matrices with MATLAB. Philadelphia: ASA-SIAM. 14 Panter, A. T., & Sterba, S. K. (Eds). (2011). Handbook of ethics in quantitative methodology. New York: Routledge. 12
4 Larry Hubert
45
There are a few published articles on this now, with my student who is the Los Angeles Dodger employee. If you think of psychometrics as test theory, do you think psychometrics needs to become a broader discipline? Well, I would turn it around. I would say that in the beginning psychometrics was very broad and it included behavioral statistics. Once item response theory took hold, psychometrics got very narrow in my view, and if you look at what Psychometrika publishes now, it really is just an IRT journal. You don’t see much work on multivariate analysis any longer, maybe some material on structural equation modeling, but that ties directly back to tests as opposed to things more general. I would hope we would gravitate away from psychometrics as a label and more toward “quantitative psychology,” or “quantitative behavioral sciences,” and include the test theory part as only one of three. The math modeling part is probably very minor, behavioral statistics is probably the most relevant, but you have other offshoots like judgment and decision-making. They have a separate society, which is not only tied to psychology but with business and all sorts of other areas such as political science. So you’d like psychometrics… To become a more inclusive group. But you still attend the yearly meeting. What do you get out of these IMPS meetings? There are a few papers that I would be interested in hearing, but they’re becoming fewer and fewer compared to when I started. My first Psychometric Society meetings were in the sixties. Since then, I haven’t worked in areas that seem to be now the dominant topics in Psychometrika. But even though I haven’t worked on topics such as computerized adaptive testing, I’ve sat on many dissertations involving CAT and related psychometric topics. One of my colleagues at Illinois is Hua-Hua Chang. I was very instrumental in having him hired from the University of Texas at Austin, and I have engaged in writing his promotion papers, and so forth. As part of the quantitative area, I’ve been very actively involved with people writing their dissertations in psychometrics proper, so I know the material. It’s just that I choose not to do it myself. But you were a president anyway. Yes, but my presidential address was on a very statistical topic; it wasn’t psychometrics at all but dealt with linear assignment models.
46
4 Larry Hubert
When you were president, did you make it your task to make it a more inclusive group? Once you get elected president, you are blessed with organizing the annual meeting. So when I was elected—this was in very early eighties—we had the meeting in Santa Barbara with the banquet on the beach; it was actually rather nice! But at the time, there was another society, the Classification Society, that I had also been president of, because of my interest in clustering and scaling. And that meant this was a joint meeting. We’ve had a fair number of joint meetings over the year between the Classification Society and the Psychometric Society. Phipps Arabie organized a joint meeting; Doug Carroll was active in both, as well as Jim Ramsay and Joe Kruskal. So, to say that I made it more inclusive, I guess I tried in this sense, but you’re only president for 1 year, and you spend most of the time organizing the meeting. Who do you think is the most influential psychometrician? First of all, the most important book historically, would be Thurstone’s The Vectors of Mind.15 For a psychometrician proper, I probably would say the Lord and Novick16 text. What did Paul Holland say? I don’t think I asked that question specifically, but Lord and Novick was mentioned a couple of times, so I can imagine he would also pick that book. I was a very good friend with Mel Novick before he died. He would have given you a very nice interview! I’m sure he would’ve! He may have mentioned Thurstone as well. I think the most important psychometrician again is probably Thurstone, for a couple of reasons: not only for his writing, but for the enormous number of students he produced. There were Tucker, Horst, and Richardson, and the list goes on and on. Even though you’re not a psychometrician per se, what do you think is psychometrics’ biggest achievement? What has it brought to the world? Or maybe hasn’t brought to the world? Well, I don’t think I want to answer that; I’m not sure if all in all the idea of measuring intelligence hasn’t brought more ill than it has brought good. The whole politics of race and psychometrics is not a very happy one. I thought that one time it might be of interest to talk at a meeting about some of the history of the past presidents of
Thurstone, L. L. (1934). The vectors of mind. Psychological Review, 41, 1–32. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental testing. Reading, MA: Addison-Wesley. 15 16
4 Larry Hubert
47
the Society; it might be referred to as the “bad actors of the Psychometric Society.” The Society has been littered from its beginning with some very bad characters. They are dead now so you can’t interview them! Some of these people have done more bad than good, and this is where we could name names, if you like. Even the “proper” psychometricians would probably acknowledge that not everything always happened with the right intentions. What about the future of psychometrics? What is the biggest challenge for psychometrics? Psychometrics has a future in the sense that the testing industry, in all its forms, is an ongoing concern. And this doesn’t only entail intelligence testing; it also entails employment selection, personality assessment, and the like. Some of my best friends do vocational interest assessment. Across the whole gamut, psychometrics has a future in the sense of being an enterprise that’s engaged in a lot. There are certain things that I see that could be considered part of psychometrics, and part of what I am involved with now is our ability to predict human behavior. And in the vernacular, we suck at prediction of human behavior, particularly things like future dangerousness and violence. We’re using these very sloppy instruments to try to incarcerate people. I don’t know if you know how bad our level of incarceration and intolerance is in the United States. Social scientists in some sense can get a lot of this laid at their feet. So, it depends. If it veers off to a belief that we can control and predict human behavior well, I don’t think psychometrics has a future in that direction at all, and unfortunately that seems to be where the money is right now. And things like measuring intelligence are so fraught with political ethnic racial undertones, I don’t think that’s really of value. You can use it to justify major social inequities. If somebody has a so-called low IQ, they don’t deserve schooling. The education of the immigrant communities is very different in the United States compared to Holland, which I know something about. The Dutch do a pretty good job, at least you put money into it. The United States don’t. The United States try to find ways not to put money into these things. What are your own plans for the future? I’ve retired! I know! But that doesn’t necessarily mean people quit working, so that’s why I’m still asking. I’m working on applied probabilistic reasoning for now. Also, Terry Ackerman asked me and Willem Heiser to organize sessions on the history of psychometrics. We’ve done those, and maybe Willem and I will do some more history things. That’s one of my interests and one of Willem’s interests.
48
4 Larry Hubert
Keep me posted on those. Yes. Is there something you’d like to add to this interview? Something I forgot to ask? I know what you didn’t ask: you didn’t ask about one’s strongest critics! One’s strongest critics are always those damn paper reviewers that you get. And you had another question; what’s number 8 under number 1? Have you ever had considerable doubt in the direction you took? Here’s a good quote for you: “Possibly in error, but never in doubt.”
Chapter 5 Jan de Leeuw
“After a year, I was not interested in psychology anymore. I didn’t like the idea that everything was, well, let’s say debatable, or uncertain, or up in the air.” Jan de Leeuw, emeritus professor of statistics at UCLA, was president of the Psychometric Society in 1987. Jan de Leeuw finished his Ph.D. under the supervision of John van de Geer in 1969 at Leiden University. His expertise lies in applied statistics, data analysis, and multidimensional scaling.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_5
49
50
5 Jan de Leeuw
How did you end up in psychometrics? I started in psychology in Leiden in 1963, and I got fed up with it in 1964, pretty quickly, so I switched to physics for a year, but I didn’t feel like doing experiments. Then I became interested in statistics and mathematics, and I was hired by Len de Klerk in experimental psychology. That was in 1965 or 1966, and he was working with John van de Geer. John van de Geer was setting up the department of data theory at Leiden University. He was dean of the social sciences at the time. I was the first one he hired into this new department of data theory, first as a student assistant and later as an assistant professor. Of course, the department of data theory did psychometrics. There was some mathematical psychology going on initially, but soon all we did was psychometrics. But data theory can imply all sorts of data, right? Was there a reason why it was mainly psychometric? Well, Van de Geer was interested in multivariate analysis and on a level far removed from actual applications, although there were always actual applications around. But the emphasis was always on developing multivariate analysis techniques, and that was fine with me, so I went along with that. But you started off with psychology… As a major, yes. But you preferred psychometrics? After a year, I was not interested in psychology anymore. I didn’t like the idea that everything was, well, let’s say debatable, or uncertain, or up in the air, or whatever the appropriate term is. You preferred real knowledge. Well yes, “real knowledge,” though not empirical knowledge but mathematics basically. You mentioned that some people worked on applications at the data theory department; did you also still work on psychological research? There were psychologists who sometimes needed something, like scaling techniques or data analysis efforts, and they were clients of the department of data theory. They were helped with whatever they needed, either something that already existed or something new. But there wasn’t that much consulting; it was a somewhat isolated and luxurious department within the social sciences. We were not very often bothered by things which had to be done. There wasn’t any real teaching, there was some consulting but only minimal, so basically, we could spend all our hours on developing techniques.
5 Jan de Leeuw
51
And you did your Ph.D. there as well, right? Yes, I did my Ph.D. in 1973. I was invited to go to Bell Laboratories in New Jersey, and it was unwise to go to the USA without having a Ph.D. degree, because you couldn’t get an academic appointment without it. So, I finished the degree just before I went. The dissertation was called Canonical Analysis of Categorical Data. It’s more or less clear from the title what the dissertation was about. Well, can you still explain it for people like me, who don’t fully understand the title? It was the precursor of what was later known as the GIFI system, which is a large reorganization of descriptive multivariate analysis techniques in such a way that they apply both to numerical and multinumerical data. It covers regression and principal components analysis, factor analysis; all those techniques are in one common framework and ultimately in a series of computer programs as well. My dissertation was basically the first programmatic statement of that program. The GIFI project itself ran until about 1990; I left in 1987, but there were still some ends to wrap up and the data theory group in Leiden that still existed at the time, wrapped up those loose ends. They produced a book, the final version of the GIFI book.1 GIFI was one of the major topics, in your career right? Certainly, in the first part of my career, after I came back from Bell Laboratories, which was in 1974. We started developing the computer programs in 1974, at the time we were in the computing center of the University, on the Wassenaarseweg, and that grew into the GIFI project. We got large grants from NWO, the grant organization for sciences in the Netherlands, and we started hiring people. Ultimately, the GIFI group was about 10–15 people who were working on the postdoctoral course that we taught in 1980 and 1981 and on the book that eventually came out in 1990. The book was named after a certain person, if I remember correctly. It was named after a butler, Francis Galton’s butler. Why Francis Galton’s butler? The reasoning behind that was that Francis Galton’s butler got a raw deal from Galton when Galton died. He served him for 25 years, he only got a couple of hundred pounds, and most of Galton’s fortune, which was considerable, went to establishing the first chair of statistics2; he donated a chair to University College Albert Gifi (1990). Nonlinear Multivariate Analysis (Eds. W. J. Heiser, J. J. Meulman, G. van der Berg). New York: Wiley. 2 Upon his death, Francis Galton donated his money to found a Chair in Eugenics at University College London, with the recommendation that Karl Pearson should hold this chair. Pearson 1
52
5 Jan de Leeuw
London. He only gave his faithful Swiss butler 500 bucks. So we needed a pseudonym for our book, and we chose GIFI. It’s similar to the Meerling books. These were two books that were written by a collective of authors, mostly in the methodology department of psychology, but the data theory department was also involved, and they also adopted a pseudonym, “Meerling,”3 Now, Meerling was not a real person obviously; it just means “multiple persons” in Dutch. And you wanted to honor Galton’s butler. Yes. We also made it some sort of a riddle, because initially, we didn’t publish who GIFI actually was. There was a picture of him in the book but people still didn’t know who he was. It was an old picture of a Victorian type of person that was more mysterious than revealing who he actually was. And is your GIFI work still finding applications? Yes, and as far as I’m concerned, the project is still going on. We have a group which now consists of Patrick Mair, who is a professor at Harvard, and Patrick Groenen, who is a professor at Erasmus, and me, and we’re sort of doing a reorganization: a reworking of GIFI with new computer programs, new theory to some extent, and new algorithms. So GIFI’s still ongoing. It’s not a group of 25 people anymore, but it’s one of the things that I’m doing in retirement. You are officially retired, but you’re still working. I’m not getting paid anymore, or anything like that, I have my retirement money of course, but there are no more obligations, let me put it that way. I don’t have to teach anymore; I don’t have to go to meetings. You do it because you enjoy it. Yes! When you went to the Bell Labs, after your Ph.D., what work did you do there? Same type of work. One of the things that was going on at the time was the work in nonmetric multidimensional scaling, which was started by Shepard and Kruskall, who were at Bell Labs at the time. Shepard was there in 1962, Kruskall in 1964. That work received a lot of attention, and Van de Geer and his students, of which I was one, were developing various versions of these nonmetric multidimensional
founded the Department of Applied Statistics at UCL. 3 Meerling. (1981). Methoden en technieken van psychologisch onderzoek, deel II. Data analyse en Psychometrie. Boom, Meppel. Meerling (1980). Methoden en technieken van psychologisch onderzoek, deel I. Model, observatie en beslissing. Boom, Meppel.
5 Jan de Leeuw
53
scaling programs. Eddie Roskam was another student of Van de Geer who worked in the same area. I started writing up this work when the data theory department was formed in 1968. I put out enormous numbers of internal reports with nice red covers, and they caught the attention of the people at Bell Labs, because they were still working on multidimensional scaling. They invited me over to work on multidimensional scaling, nonmetric scaling, various multivariate analysis techniques, and I spent a year there, in New Jersey, with my family. Did you become assistant professor after? I was already assistant professor at the time. I’m not entirely sure about the time scale, but I think I probably became what was then called “lector” around 1975, and at some point, lectors were converted to full professors, in maybe 1976 or 1977. So I became professor, and then eventually, van de Geer stepped back and I became the chair of the department of data theory, which was around 1980. We also started to organize these GIFI post-doctoral courses, and building out that project, until I left in 1987. And then you went to Los Angeles. I was invited by UCLA to build a statistics graduate program in the division of social sciences. There was some sort of initiative by the dean of social sciences to build out the quantitative components of the various social science departments, so they needed somebody in charge of that effort. It was an open application, but eventually they chose me, and I got an appointment partially in psychology and partially in mathematics, so I was professor of mathematics in psychology, starting in 1987. When I arrived there, it turned out that most of the faculty didn’t really want a division or a graduate program in statistics for the social sciences, but they wanted a department of statistics, because there wasn’t one at the UCLA. Then surreptitiously, maybe against the wishes of the dean of social sciences, we started our effort to build a statistics department at UCLA. That took about 11 years, but eventually in 1998, we got a department of statistics and I became the chair of that department until my retirement, which was 3 years ago. Did the statistics department also encourage psychometric research? I was president of the Psychometric Society in I think 1987, which is also when we had the Psychometric Society meeting at UCLA, but probably before that, data theory was already quite far removed from psychology. If anything, that department was organizationally speaking in the social sciences, and we had people working there from various social sciences, from mathematics, and from various other disciplines as well. Although it was nice to organize the Psychometric Society meeting, I had moved away from psychometrics proper quite some time before I became president of the Psychometric Society. The type of multivariate analysis work that I was doing at the time and still am doing typically originates with psychometricians though. It’s the same type of work that Henk Kiers, Jos ten Berge, Patrick Groenen,
54
5 Jan de Leeuw
and Willem Heiser are doing. So, in that sense I haven’t moved away from psychometrics that dramatically, but organizationally, all my efforts were to get a statistics department established, and as a consequence, my peer group and the people I had to talk to and the people I had to influence were statisticians. So consequently, I moved most of my efforts to statistics and I started websites about statistics, electronic textbooks for statistics, journals that were statistics journals, all to support this effort to get a department of statistics at UCLA. Did your research deal with other types of data than psychometric data? Sure, yes. Medical data, astronomy data, satellites, traffic data, bank data, credit cards, anything, name it. Because there was now a general statistics department for the whole campus and we had a consulting center established early on and we had clients from all over the university. Probably most clients came from the medical sciences, but they also came from astronomy, from chemistry, from all kinds of departments. I reckon Peter Bentler is also part of the department? Peter Bentler has a joint appointment in psychology and statistics, so he’s half psychology, half statistics. The statistics department at UCLA is not more focused on psychometrics than on statistics in general. The statistics department at UCLA is not in the social sciences, or in the behavioral sciences. Psychology is in the behavioral sciences, or the life sciences, and the social sciences are their own division in the college. Statistics is in the division of physical sciences, so organizationally we’re more related to astronomy, chemistry, physics, and mathematics, though clients come from all over the place. Of course, this has shifted over the years; everything has become much more computational and much more data-oriented than it used to be. Do you think that’s a good development? Oh yes, I think that’s a wonderful development! It’s basically, in a sense, a dream come true. I was hired over other people for that particular position because they knew that I emphasized computation, multivariate analysis and data analysis, and the computational side of statistics. And at the time, there was sort of a war going on between mathematical statistics and computational statistics. For the layman here, what exactly is the difference between the two? Mathematical statistics is basically a branch of applied mathematics, where your business is to prove theorems, and computational statistics is a branch of computation where your product is a technique, a computer program, and to analyze data. That battle raged for quite a long time. It’s over now; I mean, the computational people have won, which is a good development, and especially good for UCLA
5 Jan de Leeuw
55
statistics because we had that emphasis on computation from the start, even before it was popular, or before that war was over. When people ask you what your profession is… Statistician. Without a doubt. It’s also my title obviously, a professor of statistics. Some psychometricians affiliate strongly with psychology, or find it interesting, but you… I have basically not kept up or looked at any psychology journals since 1968 maybe, since I stopped taking psychology lectures and things like that. There’s no special tie between me and psychology. Obviously, if you’re a psychometrician, you get into contact with test theory and IRT, but I’m not sure if that’s psychology or not. I think people differ on that. I would imagine, yes. At least, it’s about psychological data. Yes, or educational data. For a couple of years, I was the editor of the Journal of Educational Statistics as well. That was a contact I had with the education people, I think this was around 1991, and I went to some education-oriented events. But I was mostly still in the company of statisticians. I’m still working with Patrick Mair, who’s a professor of statistics in psychology or in the behavioral sciences, and with Patrick Groenen who is an econometrician, who has had psychometric training (he is a data theory product), so there are still those connections. But I haven’t been to a Psychometric Society meeting since the one that I organized in 1987. Do you still read Psychometrika? I’ve had a subscription for a couple of times, but I’ve renewed it generally when I realized that I had let it lapse. I don’t think I’m subscribed at the moment. I usually look at the table of content, and if there’s something written by one of my old friends, I probably look at it a little bit more, but it’s not what I normally read. If you look at the bookcase, it’s all mathematics, computation, not even statistics actually; it’s all matrices, optimization, and mathematics. There’s no psychometrics on there. Is mathematics your true passion? I think my true passion is programming or computation. I’m interested in pure mathematics in the same sense as I’m interested in art: not as a profession, but as
56
5 Jan de Leeuw
somebody who observes it and enjoys it, like poetry. The actual work is in the programming, in the computation. Should psychology become more statistics oriented? Especially these days, that’s a very loaded question. It’s always been a loaded question, but now, there’s this replication crisis of course, and the replication crisis has as its main message that the statistical methods used in psychology are bad, bogus, not very conducive to building up a cumulative science. It’s something that I have written on in the past, quite a long time ago,4 but it’s now coming more to the forefront because of the extrasensory perception experiments and things like that. I think, as a sort of general observation, as somebody who is not really involved in the debate, that it’s probably necessary that psychology thinks very hard about the statistical methods it uses at the moment and changes something in their paradigms. I’m not entirely sure what that should be because, again, I don’t really follow the debate except at the level of Science or Nature, or those more general journals that write on this crisis. But regardless of the reproducibility problem, you think that statistics is a good framework for psychological research? Basically, any discipline that collects data will benefit from what used to be called statistical techniques and is now often called data analysis techniques or data science techniques. There are all these new names that are mostly bogus because they’re just there to make money, to get grants and things like that, but the idea is still the same. You have data, you want to present them in such a way that they convince people of something that is either true or not true, and you want to do that in a way which conveys as much information as possible and is still readable and convincing at the same time. If that is statistics, and I would like to think that it is, then every science that collects data—so not necessarily philosophy—will benefit from these techniques. Is it the job of psychometrics to improve the statistics in psychology? I would think so, yes. Ever since the standard paradigm for experimental psychology was established, which entails things like analysis of variance and t-tests, there has been pretty heavy criticism, such as the critique on significance levels, which generally has been ignored. It’s now coming back to bite them in the posterior. Psychometrics is not different from chemometrics, econometrics, and other metrics; many disciplines have a similar type of activity as psychometrics is for education and psychology. Generally, their job is to teach economists or psychologists or chemists how to handle data, how to analyze data, and obviously, how to do it well, because that’s the whole idea of teaching something. So yes, if psychology needs De Leeuw, J. (1994). Statistics and the sciences. In I. Borg & P. P. Mohler (Eds.), Trends and perspectives in empirical social science (pp. 131–148). New York: Walter de Gruyter. 4
5 Jan de Leeuw
57
better statistical techniques and all the indications are that they do, then psychometricians should play a role in teaching them how to do that. Should psychometrics also play a role in actually building psychological theory, or should that be up to the psychologist? I think that should be up to the psychologist. There’s something like mathematical psychology, which is, if I understand it correctly, not in a very good state, but it used to be seen as the summit of scientific psychology. So mathematical psychology is an actual psychological discipline and doesn’t have anything to do with psychometrics, except for the fact that they both use mathematics. But so does physics; using mathematics doesn’t say anything in itself. So, no, I don’t think psychometricians should be involved in developing theories. Of course, a particular person can be both a psychologist and a psychometrician; there’s no law against that. Someone could spend half of his time developing psychological theory and half of his time developing statistical techniques to analyze psychological data. But as its main purpose, psychometrics doesn’t have to be involved in theory building. No, it doesn’t; its definition doesn’t involve developing psychological theory. Does the “psycho” part in psychometricians just relate to the type of data that psychometricians deal with? No, it refers to the type of clients psychometricians have, which is often the same thing as the type of data they use, obviously. Tests are the obvious type of data, for example, attitude scales, but I’m guessing they also deal with data that come from experimental designs. Is that true? I think test data is probably what they mostly use. There has always been a tension between the type of statistics used by experimental psychologists and the type of statistics that psychometrics is producing, and that’s reflected in journal acceptance policies and in a lot of polemics over the years. Eventually it’s all about analyzing data, and in the right sense, it also involves collecting data and setting up experiments, but everything that has to do with data should be within the scope of psychometrics. There’s very little about this experimental approach to psychology in Psychometrika, for instance, but that’s just for historical reasons, I guess. Psychometrics, should, could, spread its wings a little bit more. Yes, I think so, especially to alleviate the current crisis maybe. But I won’t be involved in that.
58
5 Jan de Leeuw
Do you think psychometrics could play a role in other fields as well? Psychometrics has a very problematic relationship with statistics. The first generation of big names in psychometrics mostly concentrated on factor analysis, and factor analysis did not have a good name in statistics for a very long time, for obvious reasons. It’s a somewhat strange technique, and people wrote about it as if it was sent directly from God. There were all kinds of strange practices and weird computations, so, people who should’ve helped improve it instead decided to sit on the sidelines and criticize it, so that didn’t really help. And then there are a lot of things published in psychometrics that have been done better elsewhere, and there are a lot of things published in psychometrics that actually is original and up to date and innovative and ahead of its time, in the sense that other techniques or other sciences developed it later. Generally, if the other sciences developed it later, they ignored things that were going on in psychometrics 10–15 years earlier. So, it’s an insulated field. That’s a bit of a problem; it’s insulated from psychology to a large extent because psychology is dominated by the experimental tradition and not by psychometrics, and it’s insulated from statistics because it has not really tried to penetrate the much larger field that is statistics. If psychometricians had tried to publish more in regular statistical journals, they would have prevented some of these isolation problems, and it would’ve been possible I think as well. It would’ve been hard because there’s always this protective shield around each discipline, but it would have been possible. And to some extent, it’s happening more and more: there’s more overlap between fields now than there used to be. Do you have an example of something that was published first in psychometrics and then was picked up years later somewhere else? There are many examples. Factor analysis is maybe the first example. It was often presented in a non-mathematical way, a non-rigorous way, but eventually it found its way into statistics and was transformed in the process into something more respectable and more accepted. But there are many other examples. Non-metric multidimensional scaling or multidimensional scaling in general started in psychometrics, test theory started in psychometrics, three-way data analysis started in psychometrics, and many of the results of that are still continually being rediscovered in chemometrics. I’m not a statistician, so can you explain what makes factor analysis such a strange analysis for a statistician? It was mostly the way the original factor analysts, who were psychologists, like Spearman and Cattell, presented it as some magical tool that could discover laws of nature by simple inductive data analysis. The other thing was that the actual procedures, the actual computations that were done in factor analysis, were, from a statistical point of view, fairly primitive. That’s exactly where statisticians could’ve jumped in and improved the procedures, but they didn’t do that until the 1940s and the 1950s, and then it was still an isolated example. It didn’t really happen until
5 Jan de Leeuw
59
Jöreskog’s work, which was in the 1970s. So the reasons why the technique was prevented from being accepted was partly due to the claims that people made, partly due to the philosophy of science point of view—the idea that you discover these deep underlying constructs just by computation—and partly due to the actual quality of the work from a mathematical point of view. Part of it was probably also because they concentrated on intelligence, and that has always been a problematic concept. You have not worked on intelligence? I have some publications about the IQ debate as it was called, together with Jos Jaspers, a professor of social psychology in Leiden at the time.5 I’ve also written something on psychometrical genetics.6 At the time, I was pretty interested in what was then known as the IQ debate; I’m not entirely sure if that’s still the name of it, but I published some papers on that topic, and then of course, when I was editing the Journal of Educational Statistics, I had to be interested in test theory and similar things. I published quite a bit on education as well, but always with a slant of data analysis and multivariate analysis. What do you consider your biggest achievement? What you’ll be remembered for? I don’t know how long I’ll be remembered! Who knows. What are you most proud of? Perhaps that’s a better question. The UCLA department of statistics is probably the most permanent thing that I helped establish and I was quite instrumental in having it established. It’s interesting, because it started small obviously, and it started slowly because there was a lot of opposition. Then you see it growing, and now there are about 300–400 people involved, students and faculty and staff, so I think that’s large accomplishment. I obviously didn’t do it alone, but I think I made a substantial contribution to it. From the scientific point of view, I did some fundamental work in multidimensional scaling—I’m still doing fundamental work—and there’s the whole GIFI system, which I’ve been working on since 1968. That’s a large amount of work which has produced a huge number of students, who did their dissertations with me or with other people after I left, so that’s a continuing tradition. Is there a psychometrician or statistician who has really inspired your work? Van de Geer was the first one to inspire me in that sense, and his geometrical approach to data analysis was different from what I eventually did, but it was Jaspers, J. M. F., & De Leeuw, J. (1980). Genetic-Environment Covariation in Human Behavior Genetics. In L. J. Th. Van der Kamp (Ed.). Psychometrics for Educational Debates (pp. 37–73). New York: Wiley. 6 De Leeuw, J. (1982). Psychometrische Genetica [Psychometric Genetics]. In H. C. J. Duijker & P. A. Vroon (Eds.), Codex Psychologicus (pp. 287–297). Amsterdam: Elsevier. 5
60
5 Jan de Leeuw
inspiring anyway. In terms of direct influences, it was also the Bell Laboratory people, like Kruskall and Carroll, and more distantly, Louis Guttman, who has an enormous body of work stretching over a long period of time, from the 30s to the 1970s, 1980s, of very good quality. I always enjoyed reading his work. Later on, he went a little bit off the rails. When did he go off the rails? In the 1980s, I guess. That happens to a lot of people who are part of these “schools” and have an enormous amount of influence on their students and insulate themselves. It happened to Guttman, it happened to Benzecri, it happened to Kalman, it happened to Herman Wold, and that’s not a very good development. I hope it didn’t happen to me! We’ll find out! What do you believe is the most important work ever written in psychometrics, historically speaking? I think that the general idea of latent structure analysis, with as its special cases IRT and factor analysis, is the most important idea to come out of psychometrics. It’s not too strange to maintain that it originated in psychometrics. There was no psychometrics at the time, but it was developed by people who were doing what we now call psychometrics, and it has been developed to a large extent in psychometrics until it became a respectable and popular method in other disciplines as well. What is still the biggest hurdle, or the biggest challenge, for psychometrics? That would probably be the training or education that future psychometricians receive. I’m not entirely sure about the situation at other universities, but I know that at UCLA, there’s some psychometric teaching and education, but Bengt Muthén can probably tell you more about that. And there’s some training in psychology and Peter Bentler can tell you more about that, but it’s fairly minimal. I also think that probably there’s not enough contact, even now, between psychometrics centers and official academic statistics. Obviously, we try to do something about that, but it has only been partially successful. It’s easy to make joint appointments but it’s more difficult to integrate teaching activities in different departments. I think it’s probably a challenge to keep things going in psychometrics. It’s not a big field, it doesn’t have a big journal, it’s not a big society; it’s marginal. Psychometrics should be much larger I think, if they capitalize on the fact that there’s now data all over the place; the whole world is filled with data. So psychometrics has the potential to be bigger. I think so yes.
5 Jan de Leeuw
61
But it hasn’t happened yet. No, but as I said, I haven’t really been involved in the Psychometric Society for quite a long time. Basically, I may not be up to date. I think many people share the idea that the Psychometric Society could be bigger if there would be more exchanging of knowledge. Yes. So, you’re retired, but you’re still doing work. I’m producing more than I have for 30 years, 40 years. I’m working with the two Patricks, Patrick Groenen and Patrick Mair, on this new version of the GIFI project, so that’s one thing. With the same two Patricks, I’m also working on new multidimensional scaling programs. And everything we do these days consists of publications, many of them electronic and open source. We’re working on the programs in R, that’s a relatively new way of making things, which is very convenient for me. Since I’m retired, I don’t really care whether something gets published or not. If people want to see something I wrote, they know where to find it; they can Google me or look at my website. That’s probably a suboptimal way of producing things if you need tenure somewhere, but it’s quick. It’s perfect for you. It’s perfect for me and in a sense it’s funny, because that’s exactly the way I wrote things in 1968 when I started producing these never-ending series of internal reports at the data theory department, and I’m now producing a never-ending series of internal reports right here in Portland. The circle comes back around!
Chapter 6
Bengt Muthén
“I think psychometricians have stayed on the sidelines of the statistical mainstream.” Bengt Muthén, emeritus professor at UCLA Graduate School of Education & Information Studies, was president of the Psychometric Society in 1988. Muthén finished his Ph.D. under supervision of Karl Jöreskog in 1977 at Uppsala University. His research interests are latent variable modeling, time series analysis, and analysis of categorical data, among others. He is the co-designer of Mplus and is now fully dedicated to its further development.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_6
63
64
6 Bengt Muthén
How did you end up in psychometrics? That’s a good question. Maybe it’s accidental actually. I was at the statistics department at Uppsala University, and I didn’t quite know if I was going to go further in statistics, but one day I heard that this incredible guy, Karl Jöreskog, had returned from the Educational Testing Service (ETS) after about 10 years there, and he had supposedly ten dissertation ideas in his suitcase. I was lucky enough to become one of the six or seven in his first cohort of graduate students. He was really inspiring of course; he could teach you really clearly and then bring you up to speed, bring you up to the research frontier very easily and tell you what needed to be done to go beyond that. His applications were mostly psychometric in nature and mostly on US datasets. We had very few Swedish data to work with. So, psychometrics and psychological applications became what all of us in that group worked with. That’s how it started. So, he brought the applications from ETS? Right. His dissertation was on factor analysis estimation methods,1 and after his Ph.D., he got the opportunity to go over to ETS and perfect it more. Before you worked with Karl Jöreskog, you did a master’s degree in statistics. What was your specialization? Well, it wasn’t really a master’s degree, but there was a paper I wrote along the way on analyzing longitudinal data from a medical hospital in Uppsala. The paper was actually on health data on seven dogs measured repeatedly, so there was a biomedical slant to the whole thing. And that stimulated me to do what all of my mentors did: writing a computer program. I wrote a maximum likelihood computer program and took into account the autocorrelation across time. I haven’t come back to that till pretty much now, that is, time series analysis, which is what we’re focusing on in a new version of Mplus. You mentioned before the interview that you’re still working on your computer program Mplus; is that program derived from the one you were working on for the time series data? It was not derived from that program, but the interest in writing computer programs was always there, from the beginning, I think. My mentors were Karl Jöreskog and Anders Christoffersson, and whatever statistical idea they had, they eventually translated it into a computer program. I got inspired by that and did the same.
Jöreskog, K. G. (1963). Statistical estimation in factor analysis: A new technique and its foundation. Doctoral Dissertation. Uppsala University, Department of Statistics. 1
6 Bengt Muthén
65
Christofferson was your second advisor? He had done a lot of work on econometrics, but I was more inspired by his factor analysis work, specifically factor analysis of dichotomous variables. He had a paper in Psychometrika2 and developed an estimation method. I came up with another method and published an article about it in Psychometrika in 1978.3 That article brought me to the computer programming, to factor analysis, and to categorical variables, which became main themes for the future. I graduated in 1977, and the dissertation was a series of five papers, which were on categorical variable modeling, factor analysis, and structural equation modeling with categorical variables, particularly dichotomous variables. Was factor analysis a theme throughout your entire research career, or was that a temporary project? It turned out to be a long-term theme, because I worked on that for probably 10 years. But after that, I got inspired by other things and expanded it: my research went from categorical variable modeling to multilevel modeling and then to mixture modeling, which is about categorical latent variables rather than categorical observed variables. There is quite a fundamental distinction there. So you have a Ph.D. in statistics, but these topics sound more like psychometrics, right? Yes, they could be characterized as psychometrics, I think. Was it characterized as such at the time? We thought of it more as applied statistics. There were two departments at Uppsala University: one was a mathematical statistics department, and the other was a statistics department. The latter was where I was, which was more applied, though still quite technical. I see it as pretty much what the psychometric training is like in Europe, say in the Netherlands: heavily statistically oriented, but still always with an application in psychology. Did you actually do any psychology during your degree? I have never studied psychology at all. Actually, my way into statistics was via sociology. I didn’t know what I was interested in, so my first topic at the university was sociology, and the only thing I really liked that had some substance to it, I thought, was statistics.
Christoffersson, A. (1975). Factor analysis of dichotomized variables. Psychometrika, 40, 5–32. Muthén, B. (1978). Contributions to factor analysis of dichotomous variables. Psychometrika, 43, 551–560. 2 3
66
6 Bengt Muthén
What about sociology appealed to you in the first place? It was the study of people in society and I thought that was interesting. I guess I had some interest in people. But anything that I know about psychology I picked up from the various statistical applications and articles I’ve written. Do you consider yourself a psychologist now? No, no. Do you still have an interest in it? Yes, I think psychology generates a lot of interesting analysis problems, just like medicine or the biometric applications I did in Uppsala. Here at UCLA, I was at the school of education and education generates interesting questions too, particularly multilevel questions. But I would say, psychology is probably the application area that generated most of my analyses in my research papers. So, after you wrote dissertation in Uppsala, you left for UCLA. In 1977 I graduated, and stayed in Uppsala on research projects for about 2 years. I worked on a grant, with Jöreskog and Sorbom, and within that grant I took a trip to research institutions in the USA for 3 months, from Abt Associates in Boston to Northwestern near Chicago, to Stanford, and finally down to UCLA. At UCLA I gave a guest lecture on my dissertation, and there was a student in the class whom you saw earlier today, when you arrived here. Sorry, for a moment I was a little confused, but you mean your wife, Linda! I came to the USA to look for research ideas, but the hidden agenda was that I looked for places where I would want to work. Was there a future for you at Uppsala? I was interested in becoming a professor, but a professor in Sweden particularly at that time was more of an administrative position than a research position, and that was not my interest or skill area. I probably would not have gotten an interesting job in Sweden at that point. That was one of the reasons I came here, and the other part was the reason I stayed here. Jöreskog stayed in Uppsala. He did; he went back to Uppsala, became full professor there, and lead our research group that we all enjoyed. I think he enjoyed it too, but as the years went on, I think he became burdened by the administrative part of the job. But he stayed, and so did Dag Sorbom.
6 Bengt Muthén
67
Is it still a flourishing statistics department? It’s certainly moved away from the direction that it had back then, when Jöreskog and structural equation modeling was at the center of the department. It has gone back to its tradition of time series analysis and econometric analysis, but more recently I think it has broadened a little bit. In 1980, I came to the USA for the first time, I came back in 1981, and looked for a job and a position for a methodologist opened up at the graduate school of education at UCLA. They had a research methods division within that school. Before that, I stayed at Peter Bentler’s psychology department. From 1981 to 1982, I worked on a grant written for the institute of justice—they had a methodology program—and Peter helped me channel it through to the psychology department. And in 1982, I got a professorship in the school of education. That’s where I stayed for 25 years, until I took early retirement at 59, and started working only on our computer program Mplus.4 And that’s perfect for me. The research that I’d been working on throughout my whole career fit naturally into what we wanted to do with the program. So developing that program was a chance to continue doing my research. And that’s what you’re still working on now. Yes, I’m working fulltime on the program, together with Linda, two great programmers. We also have other great personnel, including our daughter. One big happy family! Linda and I just finished a book together and version 8 of Mplus. It’s a very thick user’s guide, and in the summer, I’m actually going to teach on it in Utrecht. There will be a one-day workshop and an Mplus users meeting. That’s on July 15, and then Monday the pre-conference workshop at the Psychometric Society meeting in Zurich, and then there is a three-day workshop on Mplus at the John Hopkins University in Baltimore. So there it is, from beginning to end. I’m not very familiar with Mplus. What exactly is it designed for? It has the theme of latent variable modeling, which has always been my interest, from factor analysis to multilevel analysis with random effects, which are also continuous latent variables, to mixture modeling with categorical latent variables; it’s specialized in all of those things. At the time, it was an interesting situation: I was working at UCLA and had grants from the National Institute on Alcohol Abuse and Alcoholism, NIAAA, but then we heard about the Small Business Innovation Research grant program, which could give much more funding than you could get at the university. That program wanted to fund a marriage between business and
Muthén, L., & Muthén, B. (1998–2017). Mplus User’s Guide (8th ed.). Los Angeles, CA: Muthén & Muthén. 4
68
6 Bengt Muthén
academia that would be able to quickly put research into practice, and they wanted that after a couple of years the business would be self-sustaining. Linda became the businessperson and I became the academic person, and thanks to the grants we were able to open up an office here at the Westside, and hire three fulltime programmers. This happened in 1995, and we released the first version in 1998. That would not have happened within Sweden’s mechanisms; only a big country can support such project. The funny thing was that initially, we just wanted to do a little expansion of a program that I had started during my Ph.D., called LISCOMP.5 Darrell Bock’s statistical software in Chicago was distributing LISCOMP, but when we released it, it became clear that we should combine structural equation modeling and mixture modeling into one program. That was the first time these two very diverse areas came together. This all happened in 1998, and we had no idea if Mplus was going to be popular or not, but it became quite popular. And now 18 years later, we’re still going strong. Do you have any idea how many people use it? Lots! Maybe 50,000? That’s pretty big, especially in the world of science. Yes, so it’s used across the world. Psychology is probably the major field, but it’s also used by marketing research and medicine. Have you collaborated with applied psychologists or applied researchers in your career? There have been collaborations with people in education, psychiatry, and psychology. For instance, we wrote a paper on dropout modeling in longitudinal data, with a trial on antidepressants.6 Basically, you have measures of depression over time of people in a placebo group and people in a treatment group. However, the attrition over time is not random: participants drop out because of the ineffectiveness of the medicine, or for some other reason. We tried to model the dropout, and that became another example of latent variable modeling where you go beyond the missing at random assumption, MAR, to actually model the missing data mechanism at the same time as you model the mechanism you’re interested in. That was a collaboration with two depression researchers in psychiatry at UCLA. They came up with the kinds of research questions and the data, and I taught them what the statistical analysis meant and we wrote a paper together for Psychological Methods.
Muthén, B. (1987). LISCOMP: Analysis of linear structural equations with a comprehensive measurement model. Mooresville, IN: Scientific Software. 6 Muthén, B., Asparouhov, T., Hunter, A. M., & Leuchter, A. F. (2011). Growth modeling with nonignorable dropout: Alternative analyses of the STAR*D antidepressant trial. Psychological Methods, 16, 17–33. 5
6 Bengt Muthén
69
The latent variable modeling tradition is something originally psychological or psychometric, but you’d say it has purpose outside of psychology as well? I think it’s actually unfortunate that psychology and psychometrics don’t get enough credit for latent variable modeling. I think it’s a strong tendency in statistical journals to refer to early statistical articles that used latent variable modeling, but it is very seldom that you see articles referring to the psychometric literature, like Spearman’s and Thurstone’s work, or anything that came after. Psychometrics should work harder on getting cited in the statistical literature. Psychometric publishing seems to be too separated from general mainstream statistical modeling, but I think one really good inroad is latent variable modeling. Statistics is not that fond of factor analysis and structural equation modeling; statisticians think of that as hocus pocus machinations. But if you present it as latent variable dimension reduction thinking, then it’s similar to what the statisticians write about in biometrics, for instance. But they don’t refer to the psychometric literature, which always irritates me. I’ve heard before that a statistician looks down on latent variable modeling, because of the unobservability of the construct. They wonder where this theta comes from. I think that has changed in the last decade. If you look at JASA or Biometrics, the flagship journals in statistics, you’ll find latent variables mentioned and used all the time. Even in dropout modeling, researchers use them all the time but they don’t present them the same way. They might not like exploratory factor analysis; what the statisticians do with random effects modeling falls in the realm of confirmatory factor analysis, to use Jöreskog’s terms. But they don’t even talk about factor analysis. Very seldom they do, and when they do, they don’t refer to psychometrics. So they call it something else? They call it latent variable modeling. To stay away from the factors. If you go to statistics meeting, say latent variable modeling, don’t say structural equation modeling or factor analysis. It’s an interesting relationship, right. Yes. I’ve been criticized for doing structural equation modeling by statisticians and for some reason, they don’t like it. I just see it as very flexible latent variable modeling.
70
6 Bengt Muthén
Many applied psychological researchers use SEM to construct a model that represents some kind of theory about whatever they study, and they assign a certain meaning to those factors. Is that also something that statisticians have problems with? I don’t think so, not in principle. It’s very much about how you present things. I guess the clearest applications are in biometrics. Whenever you have heterogeneity, biometricians want to use a random effect, a continuous latent variable, a factor, but it has a very specific, very well-determined influence on the observed variables. But I don’t consider the difference as important; they have their own tradition, and they have not traditionally studied psychometrics. Where did psychometrics go wrong? Where did we miss an opportunity? I think psychometrics isolated itself too much. Psychometricians should publish more in statistical journals, so not only in Psychometrika but also in the Journal of the Royal Statistical Society or Biometrika or JASA, so that people will find references to psychometrics in the statistical journals. And sometimes I think psychometrics becomes a little too involved in specific areas. They have a heavy emphasis on item response theory, and you don’t see item response theory mentioned in the mainstream statistical areas. It’s maybe mentioned in the Bayesian context—there have been some authors using it—but I think psychometricians have stayed on the sidelines of the statistical mainstream. So, you consider psychometrics to be too specialized. Do you think that IRT is still important or would you say “been there done that,” let’s carry on to other types of analysis? When I’m not diplomatic, I say they’ve been tinkering on the margin for too long! I see IRT as factor analysis of categorical variables. That’s a typical situation, and in the past, it was even just one factor. How many articles can you write about that? A lot actually. Yes! I realize of course that there are very specific applications, like high stakes testing, that need to build on very precise foundations. But from the outside, it sounds like a preoccupation. I have disliked that it has taken over so much of psychometrics, because there are so many problems coming up in psychology that deserve space in Psychometrika. Peter Molenaar’s paper in Psychometrika in 19857 was very important, and it seems a fruitful area for a meeting between statisticians and psychologists. My group is writing multilevel time series analysis now. In these models, the second level is individuals, so that the parameters that guide development over time for an individual can vary across individuals, and I wonder if the
Molenaar, P. C. M. (1985). A dynamic factor model for the analysis of multivariate time series. Psychometrika, 50, 181–202. 7
6 Bengt Muthén
71
technology that we have for multilevel time series analysis would be of interest to statisticians. We should come back and feedback. Psychologists would be interested in how autocorrelations vary across people, but I’m still searching for papers with a non-psychological application that take an interest in individual variation in autocorrelation or variability. I don’t know enough about that topic to say for sure, but I like the scenario where psychology asks a lot of interesting questions that then motivate psychometricians to develop something that could be of interest also for mainstream statistics. Do you think there is a trend that points toward the within person analysis? I think so. I’m not informed broadly enough to say but I have a feeling it is. We now have all these techniques for collecting data, like smartphones. We have a research group, a prevention science methodology group that has weekly conference calls, and Naihua Duan, who was at the RAND corporation here in Santa Monica and then Columbia University, gave a talk about N = 1 trials, so, one-person randomized trials. In these trials, a person chooses two treatments and he stays on one treatment for 2 weeks and then shifts to another treatment for another 2 weeks and then comes back. From those kinds of one-person time series data, you can see what treatment effects are. And that should also be considered psychometrics. I think so. So you don’t consider psychometrics to be only related to testing data? No, and that’s the thing. There are so many other interesting questions that come up in psychology that psychometrics could focus on. And is there something specific that psychometrics can offer to other sciences? I definitely think so. The example would be multilevel time series analysis, with all the different random effects for almost every parameter in the model. That could be of interest for marketing research, as well as finance or medicine research. That’s my hunch at least. As a psychometrician you sometimes feel that when you build something, you’re not sure how much interest there is. You hope that when you build a model, others will come. Psychometrics can build good things, but of course, you have to get that first inspiration from some substantive question. Do you think research always starts with a practical problem which you try to solve? Often, yes, especially when breaking into a new area. But then of course, the statistics has a way of feeding into itself. You come up with one method and you use it a while and you realize you need to also have another feature here and another feature there. The inspiration then comes from the statistical machinery being too limited. But it starts with a substantive question, yes. It starts with, say, psychology.
72
6 Bengt Muthén
Would you say you’re not the person who locks himself up in his room and just starts working on some kind of theoretical problem? I’m not a theoretical statistician, so no, I wouldn’t be that person. I think it’s true to say that very seldom I would write a statistical paper based on another statistical paper. Do you think psychometrics can become a field that earns a bit more respect from, for example, statistics? I think so, absolutely. It just has to branch out a little bit. Fingers crossed! In terms of your own career, would you consider Mplus as your biggest contribution? Speaking about long-term work, I guess it will be. I’ve worked in many different areas. When I started categorical variable modeling, my hesitation was whether it really makes a difference to do it the right way? Or was it also possible to do ordinary least squares assuming that the dependent variable is continuous even if it is zero-one? I found myself going from area to area, hoping to contribute something that actually made a difference. That’s why I worked on not only categorical factor analysis, structural equation modeling or multilevel analysis, but also on mixture modeling. There’s a body of papers that I’m happy about, although sometimes I feel that I spent too much time working on them. Probably from a practical point of view, the Mplus would be what I’m most connected with, even though I’m just one of a team. Is there something you’ve done that you think is greatly overlooked? A paper you actually think was very important but that did not get the audience you may have wanted it do? That’s an interesting question. I actually do. I wrote a paper in Psychological Methods on Bayesian Structural Equation Modeling; I called it “A more flexible representation of substantive theory.”8 When thinking of Bayesian statistics, people are mostly worried about using priors, and I tried to point out that, for instance, in confirmatory factor analysis, you’re already using very strong priors when using maximum likelihood. I tried to come up with the notion of thinking of approximate zero, so that the prior, the distribution for that parameter, which has mean zero but is not exactly zero, doesn’t have variance zero but a small variance. That paper was about what we call BSEM, Bayesian SEM. One of the reviewers thought that that was going to be a big breakthrough, but I haven’t noticed it, not yet at least.
Muthén, B. & Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods, 17, 313–335. 8
6 Bengt Muthén
73
Did you continue working on BSEM? I have actually. The group at Utrecht University, with Rens van der Schoot and Herbert Hoijtink, have come up with a variation on that testing theme. So many have taken an interest in it, but it’s not caught on broadly. I’m naive enough to think that it should catch on more broadly than now at least. You’ve had the opportunity in your life to have a very famous advisor, probably, one of the most famous psychometricians in the second half of the twentieth century. What did he teach you? What was his major contribution to your work? Focusing on trying to do something that was really new and big, rather than making small adjustments to what already existed. Breaking new ground. Through his clarity of teaching, I felt he really knew where the research frontier was and where the gaps were. He wasn’t impressed by papers that only made little adjustments to what was already done before. He wanted to do big things. Well, he managed. He managed well, yes. Are there others who have really inspired your work? I mentioned these prevention science methodology weekly conference calls. Fifteen to twenty years ago, I met statistician Hendricks Brown, who was at the Johns Hopkins University at the time. He later did much applied work and became interested in prevention studies, and I became more in interested in statistical studies, so we crossed in that way, but he arranged many meetings in the USA at the prevention centers where we sometimes met. At these meetings, we would hear about their work and data analysis issues, and he had a very clear way of seeing the central statistical features in their applied problems. Once he explained the problems, he experienced with the data analysis, I got inspired to work on them. So, there were several occasions when what he said led me to have insights about what methods should be used. It happened, for instance, in the area of growth mixture modeling and also in noncompliance modeling, when you have people invited for treatment and not everybody shows up. How do you model the treatment effect under noncompliance? It’s a latent class problem. So yes, he was inspiring. Brown is now at Northwestern University. And he appreciates latent variable modeling! In terms of the history of psychometrics, what do you consider the most important work ever written? That I have a hard time saying. It’s probably Spearman or Thurstone, but I have to confess I haven’t read either one of them. So, I don’t know. Fred Lord and Gulliksen, they’ve also written important things. Well, I’ve read Lord and Novick. I couldn’t say really what’s the most important book, historically.
74
6 Bengt Muthén
And looking back at more than a century of psychometrics, what do you think is psychometrics’ biggest achievement? Trying to do good measurement and developing factor analysis for analyzing such measurement. I think in terms of a method, it has to be factor analysis. Rather than IRT. Yes. Because you consider IRT to be a special case of factor analysis. Yes. Certainly, IRT has more profound implications than factor analysis. For instance, at the NAEP, the National Assessment of Educational Progress, they have what I think of as a MIMIC model: different student groups taking different kinds of tests, but since you have very few tests measured on every individual, you have to have a whole set of background variables. I think of this as X, and this is Y, and what you want to estimate is the factor in between here. They do that by IRT on the Y part and multiple imputation for the factor and then translate it into something that gets published in the newspaper: for example, whether a student is on a proficient or basic level. And that is of course an achievement—a nice blending of psychometric IRT or factor analysis knowledge and multiple imputation knowledge, which comes from mainstream statistics like Don Rubin’s work.9 I don’t know how to weigh the influence on practice by IRT against the influence on practice by factor analysis. Certainly, I would think there are more factor analysis studies in the world than IRT studies, at least. It would be an interesting question actually, to figure that out. I think more high-stakes decisions that influence policy come from IRT than from factor analysis, but I see factor analysis as the basic idea. Are there psychometricians who really disagree with you about this? I would think so. I’m probably an outlier, but I don’t know for sure. Do you have these discussions with people? Well, at psychometric meetings maybe now and then I moan about that there’s too much IRT and some agree with me, some who are more statistics oriented.
Rubin, D. B. (1976). Inference and missing data. Biometrics, 63, 581–592.
9
6 Bengt Muthén
75
I’ve heard about more these kind of issues from other presidents. Some indeed are more IRT oriented and others think that IMPS should open up more. Should the Psychometric Society also encourage not only latent variable modeling techniques but also other techniques that are useful for psychological data? I think so yes. I think that’s how Psychometrika could become more widely read. Even for brain data, for example? There was a special issue in Psychometrika about fMRI. I should take a look at that again actually. I think Psychometrika should be branching out to new data areas; it would be great. I guess the basic theme of psychology is not only measurement but also trying to represent what the individual does. That’s very complex and you have to approach it from many different angles. So you tell me how much of an outlier I am. I think there are some who definitely agree with you that it’s time for the Psychometric Society to move away from the measurement idea. I think traditionally that has been the main topic. I think some agree with you, some don’t; it’s an interesting matter. Wasn’t there a presidential address quite recently, by Klaas Sijtsma, on what’s overlooked in psychology, what psychometricians don’t pay attention to in psychology, or how they can become more relevant in psychology? Klaas is very much about the tight relationship psychology and psychometrics. The two are often very separated, there’s “psycho” in psychometrics, but I think there are a lot of psychometricians not working on psychological theory. I think he believes the two should be more closely connected. I don’t know enough about what’s going on in psychology to say what’s lacking in psychometrics. Within person data is missing right? And time series analysis. Intensive longitudinal data is an example of that. Is the lack of within person analysis what you consider to be the biggest hurdle for psychometrics? That I don’t know, because I don’t know enough about psychology. I would be interested in hearing from psychologists what we’re not dealing with.
76
6 Bengt Muthén
Should psychometrics actively encourage more cooperation with psychologists? In some way yes, but I don’t know what the effective way would be for that to happen. Since we wanted to move into time series analysis, we decided to collaborate with Ellen Hamaker, who is similar to us in terms of interests and knowledge about methods but also has extensive contacts with psychologists who have these kinds of data and want to do the analysis. She knows the kinds of questions they have, so we were guided by her know-how in terms of which features the new methods and the new program should have. You need to have somebody you can connect with. A bridging person: a person who bridges the statistical and the substantive worlds. Maybe that’s what lacking sometimes: the bridging person. I think that’s true. The most interesting participant of training sessions or workshops is a PI, a principal investigator. That is somebody with substantive interests in a certain field, who also has a methods interest, though not interested enough in methods so that he or she would want to do the analysis: but she can learn what kinds of questions she now has answers to. And that’s the kind of key people who have the energy to listen to all these new methodology stories. If you’re in the field of psychology, it could all very well be overwhelming to listen to all the statistical stuff that’s going on and you don’t know what’s relevant for your studies, but those PIs try to figure that out, so they’re very important. I met many of those throughout my years. There are maybe not enough of them. Or we should make better connections with them. There are often principal investigators, on let’s say, a psychology project, but when they collect data, they realize they need some topnotch statistical advice. You’re retired from university but you’re still working on Mplus, you have a team around you. What are the things you still want to do? I want to keep innovating. In the Mplus team, I feel that we—it sounds immodest— want to be game changers. I want to keep looking toward new areas to break into, areas where we feel that there’s a need from a data analysis point of view, because the statistics are too complicated to be practically useful in terms of analysis. And I feel we’ve done that in a couple of cases already, and I want to continue. We brought in Bayesian analysis, which statisticians use all the time but very few behavioral scientists used until a few years ago. Mplus made it easy to do that, and we’re doing the same thing now with time series analysis. Time series is really hard to do for psychologists; before, you had to use Bayesian analysis in programs like BUGS, these general-purpose statistical programs that require a lot of statistical knowledge, but now we provide a simple interface to do the analysis. And thereby we stimulate good substantive research. Researchers don’t need take so much time for the techniques but can focus on how to best model and how to interpret and use these techniques, so I can imagine doing that until…
6 Bengt Muthén
77
Is psychometrics lacking that urge to innovate, to do new things like Jöreskog did? I don’t know if it’s a psychometric-specific thing. It could well be a statistician who has that urge. But I get the feeling that a statistician would probably be more interested in writing a statistical paper than breaking into a new area, but I don’t know if psychometricians are any more inclined to do that. Is Mplus used beyond the academic field? I definitely want to make it useful for whomever; all kinds of companies and institutions beyond academia should be able to use it. When it is used in academia, it’s fun because you get a lot of research interaction with users, but I’m always happy to see when non-universities use it. So that’s definitely an aim. It must be a nice appreciation for your work that other people are doing it. It feels fun, it feels good, and it is a reward.
Chapter 7
Paul Holland
“The problem comes from someplace else and I try to solve it. I’m a statistical consultant, that’s what I’ve done a lot of my life.” Paul Holland started working for the Educational Testing Service in 1975 and retired from ETS in 2006. Between 1993 and 2000, he was professor of education and statistics at UC Berkeley. Holland was president of the Psychometric Society in 1989. Holland finished his Ph.D. under the supervision of Patrick Suppes in 1966 at Stanford University. His research interests are discrete multivariate analysis, test equating, social networks, and causality, among others.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_7
79
80
7 Paul Holland
How did you end up in psychometrics? That’s a good question, and it has an overly long answer. So, I’m a statistician, I regard myself as a statistician who has worked in psychometrics. As an undergraduate, I was a mathematics major at the University of Michigan, and toward the end of my junior year, I knew that I was going to go to graduate school. I remember talking to one of the faculty members who I really liked—I took a lot of his classes. His name was James G. Wendell and he was a probabilist—a mathematician interested in probability theory and its applications. I told him that I really wanted to use mathematics and that I wanted to go to graduate school, but, I really didn’t want to become a theoretical mathematician. I was more interested in applications, in particular, applications to the social sciences. From when I was about 16, I had been involved in sociology and survey research. I had a long-standing interest in the social sciences, and as a major in mathematics, I was able to take many courses in sociology, anthropology, and psychology, though I took more in sociology and anthropology. So, Professor Wendell told me “that should be easy, you should just become a statistician and go to Stanford,” and that’s what I did! He also mentioned some other schools, including Michigan because it had a very good statistics department, so I had a good safety school in case I could not get into any place else! I applied to several statistics departments, was accepted to all, but ended up going to Stanford. When I got to Stanford in the summer of 1966, right off the bat I met professor Patrick Suppes. He was the first person I had ever met who had so many interests! He was on the faculty of statistics, psychology, philosophy (of course), the school of education and possibly others. And he taught a course on mathematical methods in the social sciences that was listed in the statistics department, and that’s how I got connected to him. My Ph.D. dissertation was typical of the rest of my life. I got a problem from the people I was around, who were the mathematical psychologists that were at Stanford, and I tried to solve that problem. There was a group of Psychology faculty—William Estes, Gordon Bower, and Richard Atkinson—and they were all working on various aspects of mathematical psychology at this center for Mathematical Methods in the Social Sciences in Ventura Hall at Stanford. Suppes was a part of this center and it is where he had his main office. So, one day, Atkinson asked me about some statistical calculations that he and his group were doing. He asked (as we met in the hallway), “do you know how this technique we are using works? What are the degrees of freedom?” And so I went off and tried to figure out how their method worked—discovering that I could find, at best, an upper and lower bound on the “degrees of freedom” that Atkinson had asked about. I remember writing the material up so I could tell the results to Atkinson, and, off-handedly, handing it, to Suppes so he’d know what I was doing. After looking a it, Pat said, “This looks like a thesis to me!” And I was flabbergasted! I thought it was just a problem I was solving, but we jazzed it up and it became my dissertation.
7 Paul Holland
81
What was the dissertation about? It was a way to estimate parameters for these complicated models that they were fitting to equally complicated learning data. Often these were data from what they called “concept learning” experiments.1 These researchers had statistical models that attempted to predict the distributions of all kinds of aspects to these trials, over a single or, possibly, multiple subjects. Then they would try to estimate the parameters for these modeled distributions of the sequences of responses. Atkinson and his colleagues would do this estimation by a procedure that they call a “minimum Chi-square method”—which was modeled after a very old method from the statistical literature. They minimized a function that was a sum of several Chi-square-like quantities. They couldn’t actually differentiate the sum of Chi-square functions, set it to equals zero, and then solve the resulting equations. They just tried out various possible parameter estimates and attempted to improve them by minimizing, on a computer, these “Chi-square-like functions,” until they couldn’t get the value any smaller. My result was that it wasn’t an ordinary kind of Chi-square; it was actually a sum of different chi-squares from different aspects of the same set of data, and they were all correlated. So, I worked out how to describe the distribution of the resulting minimized sum of Chi-squares. I gave some advice to potential users. It was complicated, but I learned a lot from trying to help Atkinson and his students. My first job after my Ph.D. was at Harvard in 1966, in the Statistics Department, and I taught there and did statistical research and got involved in two big projects that had a long-term consequence for me. I didn’t get tenure at Harvard, so after my term was over in 1972, I started working at the National Bureau of Economic Research, which was just down the street, in Technology Square, near MIT. That’s where I met numerical analysts who had quite some influence on my work. Robust statistical methods were all very hot at that time, and I did some work on robust regression. And then—this sometimes happens to people—the environment there became very unpleasant; they started firing people whom I thought well of. They didn’t fire me, but it wasn’t a nice place to work anymore. So, I started looking around, and a friend of mine, Donald Rubin, a former Harvard graduate student whom I had known when I was teaching, was working at the Educational Testing Service at the time, and he said “why don’t you come down here, to Princeton?” And so I did! I traveled back and forth from Hingham to Princeton but eventually moved there with my family of four, a dog and cat. I started this long-term collaboration at ETS with all sorts of researchers in the field of testing and assessment. I was probably, all together, at ETS for 30 years or more. There was a break in the middle when I went to Berkeley for a few years, but Paul Holland explains concept learning as “experiments involving showing subjects a set of items that varied in a variety of dimensions—shapes, number of figures, color of figures, etc., and the researchers would get trials of data from individual subjects in which the subject would predict that the item shown was ‘in the concept’ or not, and then be reinforced with ‘correct’ or ‘incorrect,’ and then proceed to the next trial. The procedure stopped when the subject made enough correct responses that the experimenter believed that he or she had ‘learned’ the concept.” 1
82
7 Paul Holland
my main work was done at ETS. And fairly early on, I started working on test equating, and that’s how I got into psychometrics. Test equating is a weird kind of subject, and it’s something that I’ve worked on quite a bit. Can you explain a little what that’s about? So, why do testing companies, like ETS and ACT, need to “equate” tests? The tests of interest were what I call, “tests that matter”—the GRE, the SAT, the College Board tests, the GMAT, the LSAT, all of these “ATs.” They were all housed at ETS at that time. For security reasons, these tests had a new test form produced for every test administration. Remember, these are all test that matter and so test security is a serious enterprise for them. Occasionally, they would reuse a test form, but that was very rare, and only under special circumstances where security could be assured. And the test producers worked very hard to make the different test forms equal in difficulty, content, and other matters, but in fact, empirically one could see that they were never really equally difficult. So “test equating” is the statistical adjustment you need to make to the scores, after the test is given but before the scores are released, to take account of the empirical differences in test difficulty and other factors. You want a score of 650 on the SAT scale to mean the same thing, test after test, year after year. They give the SAT seven times a year, so there would be seven different forms for the SAT each year. It’s a very complicated task to stabilize those scores on the different test forms. I’m sure that, early on, ETS, the College Board, the GRE, and the other test users found out the hard way that they couldn’t do it only using human judgment. All serious test producers were, eventually, forced to use statistical methods to remove any remaining differences in test form characteristics when they reported final scores to examinees. So that’s how I got involved in test equating, and once you’re in there, you’re rubbing shoulders with people who do psychometrics every day. And, at some point I started going to AERA and the Psychometric Society meetings and became involved in various aspects of psychometrics. But, you see, it was all very circuitous, and I have never regarded myself as only a psychometrician. I’ve always had statistical interests and at some point we’ll talk about my other interests and where I think I have made contributions, and they’re not at all just in psychometrics. We’ll definitely come to that. But I’m also interested in your advisor, Patrick Suppes. What did he teach you? He and I got along really well; I don’t know exactly why. I did some things he thought were useful, but, I’m not really like him at all. He had incredibly wide interests; mine were often deep but narrow. The only course I ever took from him was that course in my first quarter at Stanford on mathematical methods in the social sciences. He talked about all kinds of different things that he had done, and that other people had done, and I thought it was quite fascinating. After class one day,
7 Paul Holland
83
early on, I told him that this was exactly the marriage of mathematics and behavioral science issues and questions that I found interesting and that I wanted to find out about. And I asked, holding my breath, “How do I become a student of your?” And he just said, “Ok, you are! Come over to Sequoia Hall and we’ll find you an office!” Pat was probably the smartest person that I’ve ever known at the level of seeing him regularly. In the 1990s, I went to three retirement parties of his! He retired three times, at least! I don’t know how one is going to have three retirement parties! Pat was a big deal to many people. And was there something you’ve learned from him, something that you think that was specific for Suppes? That’s a good question. Right now, I can’t put my finger on any subject matter. There were certainly the practical things like, “unless you write it, it doesn’t exist.” If you have an idea, you should drive it to completion, write it down, and publish it. I certainly wasn’t as good at that as he was. Pat could do that in five different subject areas and in probably more than one language! I can’t really say more than that. It wasn’t a single thing; it was more of an attitude. He was primarily a philosopher, right? Well, I wouldn’t say that. He certainly was an important American philosopher, and I remember that, at one point, he became interested in causal inference. He wrote a book on probabilistic causality.2 I remember, later in my career, discussing it with him, because I thought his way of looking at causality was not all that helpful, I didn’t think it was the right way of thinking about it, but he always had some answer for any of my objections. When I saw him in action, I saw him as mainly a learning theorist. At Stanford, in the summer, the institute invited all kinds of people in mathematical psychology, Duncan Luce, Robert Bush, even John Tukey! All these guys would come through the institute, spend a few days or weeks, and give talks. Summer at Stanford was very exciting! Suppes would also give talks. That’s what I would see of him, definitely not just his philosophy. So it was more the measurement theory you saw? Well, yes, Luce, Bush, and Galanter were writing their three-volume Handbook of Mathematical Psychology3 that had chapters on measurement theory. Okay, I guess I did learn one matter of substance from Suppes, now that you mention measurement theory. He had this notion of axiomatizing measurement things, and that’s certainly a philosopher’s perspective, or a logician’s. He was a logician, among other things, and he had this notion of having a set of axioms, and Suppes, P. C. (1970). A probabilistic theory of causation. Amsterdam: North Holland. Luce, R. D., Bush, R. R., & Galanter, E. (Eds.). (1963). Handbook of mathematical psychology: I. Oxford, England: John Wiley. 2 3
84
7 Paul Holland
a representation theorem for them that would “represent” the measurement in numbers of some sort. A representation theorem was a formal way to produce numbers or something like an ordering from measurements. These two things, together, in his view, were necessary to have a formal theory about any kind of measurement. I don’t know if I’ve ever seen such a theory in psychometrics, at all. But, I certainly think that it is not a bad goal to ascribe to. I’ve seen him do lots of different versions of these axioms paired with a representation theorem, for very weak forms of measurement, like semi-orders; he had axioms and representation theorems for all those things. I learned that and I learned to appreciate why you might do that, but I never did that sort of work. Was there like a reason for you to say, ok, I will do something completely different? I guess I’m more practical. Remember, in my dissertation, I was trying to solve someone else’s problem. And, that’s sort of how I’ve done a lot of things. The problem comes from someplace else and I try to solve it. I’m a statistical consultant and that’s what I’ve done a lot of my life. Problems come from some real area, science or elsewhere, and I then look to to see whether I can make a contribution. The contribution is often formal, but the problem is real. I ascribe to a certain perspective on the field of statistics. You get ideas from real problems that someone else, with real substantive interests, has. I have never understood people who work on only one problem area all their lives. There are some mathematical statisticians who work in areas very near to their dissertation area for the rest of their lives. I’ve never understood why anyone would do that. You’ve done research on quite a rich variety of topics. Which are the most important ones for you? I made an attempt to write these things down for this interview. This will be roughly in the order in which they occurred. So, the first thing was a big project when I was at Harvard that resulted in a book on discrete multivariate analysis.4 I co-wrote this book with Yvonne Bishop, Stephen Fienberg, and Frederick Mosteller. Mosteller was the Professor with the funding trying to make it all happen, and he, mostly, oversaw our work. There were other people involved, but Steve and Yvonne and I were the drivers of this project. At that time, there were plenty of books on continuous multivariate analysis and normal distributions. But still to this day, I have very rarely seen continuous data in any real application, much less multivariate continuous data. Even in the physical sciences, there is always granularity at some point. And in the social sciences as well, which is where I saw a lot of applications, there were zero-one and other types of categories; everything was categorical, or discrete—sex, race,
Bishop, Y. M., Fienberg, S. E., & Holland, P. W. (2007) Discrete multivariate analysis: Theory and practice. Springer Science & Business Media; 4
7 Paul Holland
85
education, opinions, etc. So we worked on this book, and it was well received. It was published in 1974, a long time ago. I’m going to give you a copy just so you have it, because I have a few left! That’s nice! So, the book on discrete multivariate analysis was the first project we worked on. Almost at the same time, I was asked to be a discussant for some papers in sociology. Fred Mosteller was in the American Sociological Association and they were having an annual meeting in Boston, around 1970. He had been asked to be a discussant of three technical papers, by very significant people at that time—Darrel Bock, Nathan Keyfitz, and James Davis—and he asked me, as a junior faculty member, if I could do it, instead. So, I looked at these three papers, and I thought that the one by Keyfitz was very interesting, but the paper by Bock sort of answered the questions raised there. So in my discussion, I suggested that Bock should be Keyfitz’s discussant, and Bock obliged, probably annoyed that I did that. The paper I did want to talk about was the paper by Davis and his student, Sam Leinhardt, on social networks.5 In preparing my discussion, I had figured out how to make some of the calculations that they hadn’t included in the paper. So again, they have a problem; I see if I can make a contribution to it and I try to do that. My discussion of their paper started a long collaboration with Sam Leinhardt. At that time, he was a University of Chicago graduate student with Jim Davis in sociology. Sociology had been an interest of mine, so now I had an area of application that I could put my statistical knowledge to and work on. And we did! For 10 years, we wrote ten papers, the first was based on counting triad types in a network,6 and the last of our papers introduced the first parametric statistical model for generating digraphs.7 We also published a book,8 a conference proceeding that included contributions from all the, then, current stars in the field of mathematical social networks. So we’re talking about networks here? Yes, social networks. So, that was a big deal in my career because it satisfied my desire to do something “sociological.” The next subject that I worked on was test equating. That resulted in two books. The first book was, again, a conference proceeding9 that we had convened at ETS. We got people to come from all over giving Davis, J. A., & Leinhardt. S. (1967). The structure of positive interpersonal relations in small groups. In M. Berger, J. Zelditch and B. Anderson, Eds., Sociological Theories in Progress, 2, pp. 218–251. 6 Holland, P. W. & Leinhardt, S. (1970). A method for detecting structure in sociometric data. American Journal of Sociology, 70, 492–513. 7 Holland, P. W. & Leinhardt, S. (1981). An exponential family of distributions for directed graphs (with discussion). Journal of the American Statistical Association, 76, 33–65. 8 Holland, P. W. & Leinhardt, S. (Eds.) (1979). Perspectives on social network research. New York, NY: Academic. 9 Holland, P. W., & Rubin, D. B. (Eds.) Test Equating (1982). New York: Academic Press. 5
86
7 Paul Holland
talks and we put out the proceedings. And then later on, after I came back from Berkeley—this is much later in my career—I was involved with other people on a different book on a special method for equating tests, called kernel equating, and that resulted in another book.10 Then came differential item functioning, which was a big topic—its former name was “item bias.” It has to do with the problem that some test questions may be more able to be answered correctly by test takers of particular races or ethnicities. The reason I got interested in this was because ETS was involved in a lawsuit and I was an expert witness on ETS’s side. The suit was about the Illinois real estate licensing exam, that was the actual test, and the company that was suing was called the Golden Rule Real Estate Company. They were suing the state of Illinois for this test that ETS made and administered, so ETS got very involved in the lawsuit. The test was being sued for racial bias. I was asked to help with the defense, and the discussions went on for a long time—months. These discussions were with “the other side’s” lawyers and their experts, trying (and I am being generous, here) to come to some reasonable understanding about what would be a plausible measure of bias in test items for ethnic groups and other things. And in the middle of that, I said to myself, “ETS ought to have a REAL method for measuring item bias.” What they were putting together was an inadequate method, with a lot of compromising and had to be simple-minded “so non-technical people can understand it.” And, suddenly, I found myself trying to do something sensible in the area of “item bias.” I started doing some initial calculations and evaluations, which reflected the well- developed categorical multivariate data analysis that I had written a book on years earlier. We, my colleague Dorothy Thayer and I, ended up choosing a method called the “Mantel-Haenszel adjusted odds ratio.”11 This approach, the MH, had been widely used in biostatistics, so it was not a new method, but one with an established track record. I was trying to work out the details so that Program Directors for the test would actually not resist doing it, saying that, as they almost always did for any new ideas for their tests, “it would cost too much.” I really wanted the method we proposed to be used. We could have tried to use the, by then, fairly well-developed item response theory methods for item bias, but at that time, it was a big expensive deal to use IRT in actual test programs. This was the 1980s, so IRT wasn’t really something you could do easily. Had it been now, I might have used IRT methods, but at that time IRT would not pass the Program Director’s cost test. So we pursued the MH method, it was fairly easy to do, quick and cheap on a computer. It has everything I wanted as a statistician: there was a parameter to estimate, a test that went with it, standard errors; all the pieces that you need in order to make sound judgments about test items. That’s how I got involved in that activity.
Von Davier, A. A., Holland, P. W, & Thayer, D. T. (2003). The kernel method of test equating. Springer, New York. 11 Holland, P. W., & Thayer, D. T. (1986). Differential Item Performance and the Mantel-Haenszel Procedure. Paper presented at the 67th Annual Meeting of the American Educational Research Association, San Francisco, April 16–20, 1986. 10
7 Paul Holland
87
Howard Wainer and I eventually had a conference at ETS on DIF, differential item functioning, which is what “item bias” became known as. That resulted in a book12 and that was a lot of work, but when we did that book, I said, “I’m not doing DIF anymore.” But what’s amazing to me is that DIF lives on! I met somebody yesterday at this conference, who is doing DIF—it has actually become a subtopic within psychometrics! And then, finally, because of my interactions with Don Rubin, I became interested in the statistical aspects of “causal inference.” I figured that I had my way of expressing what Don was saying, and by telling people about it, I could make his ideas more well-known. Don was much more a man of mathematical details than I was. I tried to give a global idea of what his ideas were about. We wrote several papers that have to do with applying these ideas to various problems. There was one on “Lord’s Paradox”13 where we worked out how his view of causal inference applied to Fred Lord’s paradox about non-randomized studies of causation. In another paper, we showed how you use Don’s ideas to estimate causal effects in retrospective, case-control studies.14 These are studies where you have sick people and similar non-sick people and you look back and see what they expose themselves to and try to find out why they got sick. So it’s a backward study, it’s retrospective. You have cases of the disease and cases of the controls, and then you try to see how are they different. This is a very inexpensive way to do research in a medical field, but it’s backward from an experiment, where you start off with people who are the same and then you treat them differently and see what happens. But the analyses of case-control studies pretend that an experiment has happened, with some information not available. So Don and I did a lot of things trying to apply these ideas to various types of study designs, always trying to explicate how they worked, how they didn’t work, from the point view of what I started calling “Rubin’s model.” At some point in this work, I wrote an article for the American Statistical Association, called “Statistics and Causal Inference.”15 I put a lot of effort into that because I thought it was important for these ideas to become better known. That paper has had a lot of press, a lot of people read it, and many have told me that it helped them understand causal inference better. So those are the main subjects that I regard myself as having worked on and (using Suppes admonishment) published on. I’ve worked on other subjects as well, but those are the major things I think of myself as having contributed to.
Holland, P. W., & Wainer, H. (Eds.). Differential item functioning. Hillsdale, NJ: US: Lawrence Erlbaum Associates, Inc. 13 Holland, P. W., & Rubin, D. B. (1983). On Lord’s paradox. In H. Wainer & S. Messick (Eds.) Principals of Modern Psychological Measurement. (pp. 3–25) Hillsdale, NJ: Lawrence Erlbaum. 14 Holland, P. W., & Rubin, D. B. (1988). Causal inference in retrospective studies. ETS Research Report Series, 203–231. 15 Holland, P. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81, 945–970. 12
88
7 Paul Holland
Which is your favorite? The equating work seems to have more legs than most everything else. I just couldn’t stand to do DIF anymore. You were done. I was done with it. Social networks I’m still interested in. I went to a conference a few years ago and found that some people are working hard at it! And, some are doing things that are related to the stuff that Sam and I worked on, years ago. I was surprised to see that networks were still a subject of active research—even though Sam and I had good reasons to work in the way we had. Some of the participants in this conference were from Facebook. They have these billion-node networks all computerized and they were doing some of the same types of calculations that Sam Leinhardt and I had done years earlier. They even knew a lot about what we had done. In the Facebook social networks, the graphs are generally not directed; they’re just undirected links. Years ago, I did the calculations for the expected frequencies for undirected quartets instead of the triads that Sam and I had focused on. At this conference, the guys from Facebook said “we did that too because we thought that would be a natural extension of what you did.” I thought, my gosh, it has actually carried on to something new, and it has applications to real data. That was kind of nice! The work with Sam on social networks has had some very positive things for me; I don’t regret doing any of it. You’ve mentioned sociology now more than psychology. I was wondering where this affiliation with sociology comes from; why does it appeal to you so much? Well, now we are getting into my own history. My father was a sociologist. He died only a couple of years after he got his PhD, so it was rather a traumatic thing. We were living in Cuba at the time, while he was doing fieldwork, and I was about 12. When I was 16, a couple who had been my father’s friends when they were graduate students together at MSU and who were now professionals in the field, Walter and Jean Boek, visited my mother, my sister, and me. They said “we’d like to take Paul this summer to go to Winnipeg with us, and what we’re going to do is interview Indians and Métis” (that was the official Canadian word for people who are half Indian and half not Indian). Walt and Jean asked if I could come along and learn something about field research, which was what my father did while I was growing up and when he was getting his Ph.D. at Michigan State. I was a 16-year-old kid, and they thought I could help with the project! I would live with them in the house they had rented in Winnipeg. I had just learn how to drive so I could drive to do the interviews, and they’d show me how all the stuff of real field research works in sociology and anthropology. So, at 16, I was learning how actual survey research was done in the field, knocking on doors and filling out interview schedules.
7 Paul Holland
89
The Indians and Métis living in Winnipeg were kind of a tough group to find. I had a list of addresses that the Boeks had obtained from official sources, and I would drive to the various areas and find the people and interview them. Most of the time it was pretty easy. I would knock on doors, find someone on the list and interview them. But, I remember getting yelled at by this one guy. He thought I was from the health department (they were always scared that we were from the health department and that I might cause them to lose some kind of social service). But I tried to tell him I wasn’t, and at some point, I remember loudly saying “I don’t lie, it’s against my tenets.” He then calmed down and said, “I’m a religious man myself.” He then invited me back up the stairs he had just kicked me down and he gave me the full interview. That was kind of an astounding experience for a 16-year-old! But I learned that perseverance and honesty would actually pay off. Sometimes you’ll hear about “curb-stoning,” when the supposed interviewers sit on the curbstone and fill out the interviews themselves; they don’t actually interview anybody. I could have done that, I guess. But it never occurred to me! I never thought I would actually do survey research in my career, and I haven’t, but I’ve had this practical connection to it. And as an undergraduate, I was very fortunate. At the University of Michigan, there was a great anthropology department and a very good sociology department. Ted Blalock was a young assistant professor in sociology there, and he taught this course in sociology that we all attended. I liked it very much. So, that’s my whole connection with sociology before I met Sam Leinhardt. Do you have a connection to psychology? In my graduate school experience, there was a strong connection to mathematical psychology. I shared office space with young graduate students such Bill Batchelder, Jack Yellott, Dick Shiffrin, Bob Bjork, Steve Link, Joe Young, David Rumelhart, and many others. They’re old men and women now, some may even be dead. Dick Atkinson was one of the leaders of this center, he became the chancellor at UC San Diego, and he built up that branch of the UC system. They were all experimental psychologists, in the field of learning. This was all prior to the cognitive revolution, although several of my former math psych colleagues have told me that, back then (in the late 1960s), they were doing cognitive science long before it was called that. Since “psycho” is in “psychometrics,” psychometrics is often considered to be related to psychology, perhaps more than other social sciences, but do you think it has a connection to other areas as well? Large-scale testing is the place where psychometrics really starts and has had its biggest impact. That history comes out of Binet and that’s all psychology, right? But the application of psychometrics to questionnaires—I have always thought of a “test” as a questionnaire with right answers, so it’s different from an ordinary questionnaire. Some of the same principles apply to questionnaires. Psychometrics is being applied all over the place. Education is the place where I’ve seen a lot of it, because that’s the place where right answers matter. And people now worry about
90
7 Paul Holland
what students know and are able to do. There is normative behavior as opposed to non-normative, and we always want to move students toward the normative side (and that’s the right-answer part). That aspect of psychometrics is very widely applicable. It has very little to do with just psychology. Here at the IMPS meeting, you find a lot of psychometricians that work in education. There is a lot more money in educational things than there is in psychology by itself. In addition, the samples in education are much larger. In psychology, they’re lucky if they get a hundred cases. This may be not true anymore, with the Internet and all those kinds of things, but for ETS and ACT and all the other testing organizations, there are thousands and thousands of cases. The outstanding model program is NAEP, the National Assessment for Educational Progress. I don’t know if you know what that is, have you ever heard that name, NAEP? No. It’s a huge survey and it’s also a huge test. It’s an amazing mixture of survey research and advanced psychometrics. The idea is that there are about 300 questions you’d like to ask all the kids in America in, say, fourth grade math, but you can’t ask a kid to answer all 300 questions, so you ask them 30 or 40 questions instead. And you don’t ask every kid in America but just a very carefully selected sample of them, and a different sample of kids only every 3 or 4 years or so. But the trick is that no two kids in the sample get asked the same set of questions, and this way, we can actually have all 300 questions answered by a representative sample of America, whether it’s a sample of New Jersey, or of the entire country. This would’ve been impossible without psychometrics and survey research; they contribute beautifully together. There is a complicated sampling scheme, which comes from survey research, and then they have this complex sampling of the test question. And then they have the miracle of modern psychometric models from IRT to “glue” all this information together to say what the average kid would have said on all the questions on this 300-item test! It’s a really amazing mixture of modern science from two very different fields. That’s psychometrics done in a totally new way, but the basic calculations come from Lord and Novick’s book from 1968! Would you say testing is psychometrics’ biggest contribution? From a point of view of informing educational policy, what I’ve just described is a major contribution to getting the real facts about the educational enterprise in the USA. Policymakers may ignore the results, but they are there for them to use. Another example is PISA. It’s an international survey similar to NAEP, multiple languages are involved, asking the same questions to students in different countries. People worry about trying to ask the same question in French and German, it’s not so easy apparently, but putting all this on a common scale, that is an enormous
7 Paul Holland
91
undertaking (and leap of faith). This marriage between survey sampling and testing through a psychometric model and the procedures used to estimate these population totals for many test questions are really very modern. The EM algorithm was invented in the 1980s and has been used in all kinds of different ways to deal with missing data. When you think about it, in this kind of survey, you have a dataset with all 300 questions across the top and all the thousands of people going vertically. You have a block of data here, and there, but there is a lot empty space, so there is lots of missing data. With the EM algorithm, you can actually solve that problem and estimate everything. You can’t say very much about an individual kid, but you say a lot about kids living in a certain area. Do you think that developing methods to do this kind of research is one of the main tasks of psychometrics? I see psychometrics as a kind of statistics, with a slant in a certain direction, heavily involved with models, and conscious of measurement error and variability. That’s statistics to me, and they’ve made a huge contribution: you couldn’t do things in the real world without having that marriage of those ideas together. This type of testing was inconceivable 40 years ago. Now, you have tests on the Internet, and the idea is that you can actually try to get a national average even though a very biased sample took the test. It really takes a lot of thought to do that. But as I’ve said, I think that psychometrics has a very strong statistical side, I keep thinking of psychometrics as being part of statistics, the “metrics,” and not so much the “psycho.” Though the guys that invented the field all came from psychology. Yes, they did, but you’d say that psychometrics developed into a broader field? Well, it has. Psychometrics is not the same as econometrics. That is mostly about models of an economy—using large economic models with country-level data. And it’s probably not the same as sociometrics, which usually has to do with social networks, at least in the olden days. But, I was just talking to one of my friends, and she is thinking about using social networks in a very different way than I have ever thought about. Her idea is to use social network ideas in data analysis, having little to do with sociology! The notion that some ideas have wide applicability, beyond their original use, is really important. And I think the Psychometric Society does a lot of that. They may be currently focused on one application, but there could be different applications as well! Psychometricians here are predominantly psychologists I would say. They have connections with psychology. Sociology doesn’t have strong connections to a mathematical group; they are kind of anti-math, at least they were when I was an undergrad.
92
7 Paul Holland
I’ve heard that as well; they’re sometimes not very fond of psychology, since psychologists often want to do more technical and mathematical research. Well, there’s this thing called social psychology, and that definitely has connections to both. After I go to these meetings, I’ll go to a meeting in August, to a gathering of a board of advisors for Marcia Linn, who does education research at UC Berkeley. Her thing is middle school science. It’s common, in that setting, that the kids work together, in groups or pairs, so there’s social psychology right there as well, it’s not only about education. I don’t know how Marx would have influenced any of this work. He certainly influenced the thinking of some sociologists. But I don’t think he has a lot of influence on the thinking of psychologists, or economists. Marx’s work has to do with social interactions of the social classes. I think the best things are when you try to think in an organized way about the data you have. I’m sort of babbling. Ask me a different question. What paper or book from the history of psychometrics has really inspired you? Probably the only book in psychometrics I really read was Lord and Novick.16 Also, I certainly read a lot of Angoff’s chapter on equating in the Educational Measurement review handbook.17 There are a series of chapters on equating; I certainly learned about things through that, so that’s another thing that has influenced me and surely has had an influence on several people at this meeting. These guys don’t do just one thing; they do a variety of things. I remember when Lord and Novick was first published, because Fred Mosteller was on the editorial board of their publisher, and he had an early copy of it. I remember Fred saying “this book is going to make a big difference,” and he was right. What about that work made such a big difference? Among other things, it is an introduction to item response theory in a very formal way: that’s what its big contribution was. And they had programs you could actually use. What I haven’t said so far is that without having computer doing the calculations you can just talk a bit, but you can’t, really, do anything serious. I was talking to one of my friends, and she’s saying that they are doing all this stuff in R. I don’t know anything about R but I know that’s a cool programming language to use for statistics. It’s a language you can do so many things in. And it’s an international collaboration, that’s just wonderful to me.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental testing. Reading, MA: Addison-Wesley. 17 Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational Measurement (2nd ed.). Washington D.C.: American Council on Education. 16
7 Paul Holland
93
You mentioned before this interview that you are also interested in the history of psychometrics. What do you believe is the story that should be told? Why do you think it is important? Well, now I want to get out my little book. So, this copy is for you. It’s typical when you’re old and decrepit that they’ll have a conference in your honor, and so I had a conference in my honor. My former colleagues at ETS, Neil Dorans and Sandip Sinharay, organized it, and the last chapter of this book summarizes a view of the history of psychometrics that I am fond of. What is this book exactly? The book is called Looking Back, a book that “looks back” over my life.18 That’s quite an honor. I thought so yes, it was very nice of them. In the last chapter, Neil Dorans summarizes a talk I used to give. I couldn’t find any versions of this talk before meeting with you. I have reduced most of my old work to nothing, now. I destroyed everything that’s not online; I don’t have any handouts or anything like that anymore. I once had this huge library which I gave to ETS, all kinds of publications that are now gone away. But in the last chapter of this book that I’m giving to you, Neil Dorans summarizes my ideas about of the four generations of the history of psychometrics. Anyway, psychometrics has a 120-year-old history, and I divided it into four overlapping generations. The first generation was influenced by concepts such as error of measurement and correlation, which were developed in other fields. It focused on test scores and led to the development of reliability, classical test theory, generalizability, and validity. This generation began in the early twentieth century, around 1900. It’s very interesting that there was Karl Pearson, a statistician, and Spearman, the guy who introduced the notion of reliability. Are we also talking about Francis Galton here? No, this is a little later than Galton. Pearson introduced the notion and calculation of the correlation between two variables. At the time you could get a Ph.D. by computing a correlation! Back in the day! And people did, and they found that there was little correlation between test scores and various things that they thought might be related to test scores, and that was kind of the news to a lot of people. So, they’re finally collecting data and looking at it in an organized way. Dorans, N. J., & Sinharay, S. (Eds.). (2011). Looking back. Proceedings of a conference in honor of Paul W. Holland. Springer. 18
94
7 Paul Holland
Pearson and Spearman had a strong disagreement. Spearman had introduced the notion that for some types of measurements, it was important to take account of the unreliability of the measures being correlated in forming a correlation coefficient. This resulted in his “disattenuation” of correlation coefficients for the unreliability of the measures being correlated—measurement error decreases or “attenuates” the correlation between two variables. Pearson was horrified—a variable is a variable, with or without error of measurement. He really disliked this idea and wrote scathingly about it. Spearman was just trying to get at what the correlation would have been had there not been any measurement error—the true score correlation, the correlation between the actual things that was not noisy. Pearson really hated this idea, so he wrote papers saying how terrible this guy Spearman was. So, there was this unfortunate division between a prominent statistician and one of the earliest psychometricians. I find that interesting, but sad. The second generation of psychometrics starts in the 1940s and that’s when they began to worry about items as opposed to scores, and this ultimately leads to item response theory. The third generation starts in the 1970s, when they tried to bring many more sophisticated statistical ideas and computational methods to item level modeling, and this is when we start actually being able to do NAEP work. I think it was actually 1984 when ETS gets the NAEP contract. I think NAEP was a model for many other international assessments: giving the kid a doable task, not 300 items, but 30 items, but doing it in such a way that you can say something about the 300 items, and that’s pretty impressive. I’ll talk about the fourth generation: it attempts to bridge the gap between statisticians and psychometricians and the role of other components of the testing enterprise. Testing occurs in a larger context and measurements need to occur within this larger context. One of the things Dorans likes is my distinction between test viewed as blood test versus test viewed as contest. I’ve made this distinction a lot. I think Dorans has a quote in here. Galton wrote this: One of the most important objects of measurement…is to obtain a general knowledge of the capacities’ - this is very sexist – ‘of a man, by sinking shafts, as it were, at a few critical points. In order to ascertain the best points for the purpose, the sets of measures should be compared with an independent estimate of the man’s power.
So, this is the notion of what I call the “blood test” view of testing. It’s a test where you measure something of the person—that’s the “sinking of the shafts” idea. This is an underlying basic image that is made throughout psychometrics—obviously from the very start of the field. But, all of the tests that I was involved in at ETS were competitions: the participants were competed for getting a scholarship, or to get into a college, etc. These are tests where there are winners and losers, and that’s very different from ascertaining some characteristic of a human being. This is the “contest” view of testing. These two views of testing can be in conflict. Certainly all of the concerns about cheating on a test or the fairness of tests have to do with the “contest” view, and they have nothing to do with the “blood test” view. At one point, I had thought about this distinction enough that I had examples where the blood test view was the most important and other examples where the contest view was the
7 Paul Holland
95
most important. You can think of your own examples, but they definitely are two different ways of thinking about tests and they’re not the same. Getting back to the four generations of psychometrics. There was the first generation that comes out of Spearman, with Kuder, Richardson, as well as Tucker, Lord, Rasch, and many others. Then, in the second generation, there is a focus on modeling test items and the invention of IRT. In the third generation, there is a serious focus on modern computer technology, for example, Bock and Lieberman, and how to really do the computations for IRT. That’s the third generation. It is still very early days in the fourth generation with a focus on the much larger structure of the whole testing enterprise. We have ideas such as ECD, evidence-centered design. The whole thing about assessment design is that it tries to take into account all the factors that are going into making, administering, scoring, interpreting, and using a test. I was once on a committee for the National Research Council; it was called the Board of Testing and Assessment (BOTA). It concerned itself with all aspects of testing and assessment. And I remember deciding to make myself a graph of how I thought all BOTA members should think about a test. My point of departure was “this is not just a single test, but a testing program, it doesn’t exist once, it continues in time.” All these programs, like the SAT, NAEP, PISA, don’t just happen once, they live for a long time, and because you’re interested in growth, you have to watch things over time. But this elaborate system has many players. The simple psychometric class model is “student meets test item,” but this is like the tiniest molecule in the testing enterprise. But where does the kid come from, what’s happening to the kid, how does the test question get there? What is the educational function here, what are the political consequences? Once you start thinking about these things, it’s a very elaborate system. In my view, the people on BOTA at that time all had this very simple-minded view, and I thought that they ought to have a much bigger view; some of them would see one corner of this bigger picture and another would see another corner of the picture. I thought they should have the full picture. I don’t think I would’ve been able to do that without thinking of psychometrics. This book is for you, you can find out some things from it. I’ll look through it, especially the history part! One final question, and you’ve talked about this a little bit already, but what do you think is the biggest task of psychometrics for the future? The big task of psychometrics? I don’t have a simple answer to that. I think of it as bringing this basic concept of measurement with uncertainty associated with it, its quantification and all the elaboration that’s going on to ever-increasing complicated kinds of human responses. I think, generally, there’s always going to be these human responses, and they will be measured by things like tests. I don’t know if I’m ever going to do anything with a robot; it seems pretty far out for me, but you never know. Maybe we’ll work with human-robot interactions! That might be something, especially if they’re trying to make robots more and more human. The future of psychometrics is about the open-mindedness of all the different varieties of the
96
7 Paul Holland
ways that people collect data and try to draw conclusions and to make sense of it. But this underlying theme, which I think goes back to Spearman, this notion there’s something there, you try to measure it, but you measure it kind of poorly, or with uncertainty, that stays. In physics, they do that too, but they hide that fact, they don’t tell you as much. I remember. I have a dear old friend who’s a particle physicist, and in his dissertation he showed me this graph from his research. There were these bumps in the graph, and they were doing what they call “bump hunting.” They would fit various curves to the data underlying this graph, and what they did was to add another possible bump into the equation and estimate it. The bumps mean something. That little bump, it means that the experiment measured a particle of some kind. I looked at that graph and said, “this looks just like social science data to me because that bump might not actually be there, it’s pretty noisy.” It turned out his bump didn’t really exist, and how did they find out? Replication. And that’s the thing you don’t always have in social sciences, proper replication. In physics they can repeat the experiment, and they can do that with more precise measurement and things like that. Thank you for this interview, I learned a lot! You’re welcome! I talk a lot!
Chapter 8
Robert Mislevy
“All the best stuff that I’ve ever done has come out of applied problems.” Robert Mislevy is professor emeritus of Measurement, Statistics, and Education at the University of Maryland and was president of the Psychometric Society in 1993. He also holds the Frederic M. Lord chair at ETS. Mislevy finished his dissertation under R. Darrell Bock’s supervision at the University of Chicago in 1981. His research interests include Bayesian inference networks and evidence-centered design.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_8
97
98
8 Robert Mislevy
I always start with the question: how did you end up in psychometrics? I think many people would answer the same: it was a coincidence. I had university training in mathematics and statistics. I got a master’s degree and I needed to get a job, so I went on interviews to a lot of different places: a commercial bank, a real- estate company, a management company, a telephone company teaching regression, computer programming at Argonne labs in a physics project. But the place where I thought I would like to most spend time hanging around every day during work was a small educational research organization in the Chicago suburbs. There were two researchers and they were hiring two assistants, me and another guy, and a couple of support people and that was it. So I took the position there, I had no particular plan on psychometrics or psychology or anything else, but I enjoyed the job and my bosses, Jerry Jenkins and Tom Kriewall, became my mentors. I was there for about 2 years, and they said, “we’re starting to do multivariate analysis, and we understand there’s this guy Darrell Bock at Chicago, who knows something about this,” which is of course a bit of an understatement, and they said, “why don’t you see if you can take some courses with him and learn more about the things that we’re doing?” So, I started to be a student at large, took a couple of courses, and the third course I took as a student was a course in psychometrics. It was jointly taught by Darrell Bock and Ben Wright, on alternate days, and that was fascinating: it was probably the most interesting thing I’d ever done, so I was hooked from then on. What did they teach you? Well, they both taught us about psychometrics from a latent variable point of view. Darrell’s more of a Thurstonian tradition, Ben more from a George Rasch position. They’re both really smart guys. Some of the things they were telling you were really similar, whereas other things were very different, and the two of those together made us students think a lot harder than we would’ve had to, had we only been taught by one person. They had us do projects and working with each of them talking about our projects and getting feedback; you could get different directions on how to think about the models and how they interplayed with the real things that were happening that led to the data. It was great and well, I’ve gotten lucky, I’m in a field that’s actually interesting! So, before you came to that small education testing bureau, you had no interest in psychology or psychometrics? No. I took a couple of courses in psychology, like almost everybody does in college, but had the people been nicer at the physics laboratory, I would probably still be programming at Argonne labs.
8 Robert Mislevy
99
Many people would say that they ended up in psychometrics due to some coincidence, but this is a little extreme I would say; you had no early interest in psychology whatsoever. Right. I studied mathematics and statistics and I had to get a job where somebody would pay you to be doing those things and there was quite a range of possible jobs. I actually think it turned out to be an advantage too, because coming from the point of view of statistics, you know off the bat that it’s all just about models, that the world is way more complicated than your models are. An important part of the job is figuring out what’s important to include in your models. If you’re in a situation where you can design what you’re going to observe, what data you’re going to get, you think about what models can tell you. What can you pick up that’s important? What would be nuisance? What would you pick up that’s not important that you might be able to design around? So compared to coming up in psychology, or in education, where the modeling and the substance are confounded, coming in where you already got the modeling in your pocket gives you a better outsider look at what the assumptions about the substance are. And I think that makes it a little bit easier, as the field changes or with new things happening elsewhere, to hook these new developments into your thinking. It’s partly a psychological investment, but it’s partly the way that your schema has formed: the disentanglement of the modeling and the substance is much easier if that’s how you come to the field. So you mentioned that Darrell Bock preferred the latent variable model ideas. Yes, the Thurstonian tradition. That’s the building of response process models, which comes from psychophysics. He wrote a book about psychophysical methods with Lyle Jones1 and was an expert on that. So I often use this anecdote in my classroom when I talk about expertise: experts don’t necessarily have greater memories, or traits that you would measure on IQ tests, than anyone else, but they’re different because their knowledge is organized in useful ways. So for working memory, we can deal with up to seven chunks. One time, professor Bock was writing all these equations on the board and we were furiously trying to copy them down. There was this great big mass of figures and numbers on the board, and he turns around at us and he says “and of course these are just your Müller-Urban weights.” We started to chuckle because to us, those were 25 separate things, but to him it was one thing, and that was a great lesson to us. You come to see how things combine and recombine in chunks and after a while, after a long while, that’s how you start to see things too. It’s a great example of a very important principle.
Bock, R. D., & Jones, L. V. (1968). The measurement and prediction of judgment and choice. San Francisco: Holden-Day. 1
100
8 Robert Mislevy
And did you prefer the Thurstonian tradition or the Raschian tradition of Benjamin Wright? Perhaps a little more the Thurstonian tradition. I really appreciate the Raschian tradition a lot. But in my job—I work mostly in applications—I use models other than Raschian models, which are necessary to do the job. I really appreciate the technical qualities of Raschian models, and if you want to be strongly aligned with principles in properties of fundamental measurement, it’s clearly the way that you go. Sometimes you can do that, but sometimes you can’t. If you’re working in applied work, it’s helpful to think what you would do under ideal circumstances, and then to say “well, I don’t have ideal circumstances,” sometimes because of the data themselves, sometimes because of constraints in the data collection, because of logistical constraints or resources. Being able to model means that you assemble these modules and make choices at every point that are going to give you the best inference. I’m very happy to use the Rasch framework; I really like being able to have these maps—whether they have a unidimensional or multidimensional structure—where the persons and the tasks are aligned on these common scales, and the implications for what you would observe are the same, no matter where you are on those scales. But if that’s not the situation, if the link functions are predicting expectations different from the observations, things are a lot messier, and more bound to the context. So, less robust in that sense. I don’t have the luxury of saying “ok, I’m just going to do a Rasch model.” Another thing I would say too: one of the things I really appreciated from Ben Wright’s part of the course, and also the Friday seminars that I continued to take after that: the applications that he and his students worked on were usually quite good applications. That may have partly been because they used the Rasch model, but more importantly, they used the Rasch model as a tool to think hard about the substance of what was happening in the real world and how it interplayed with the models. It was the thinking and the quality of their argumentation that really made the applications work well. And you did your Ph.D. work under Bock’s supervision, right? Yes, and Ben Wright was in the committee as well. They were focusing on different chapters of the dissertation. There was also John Bormuth, who was a reading specialist; he was one of the first people to come up with artificial intelligent ways of creating tasks for reading, using linguistics techniques. He was a bit ahead of his time, in that computing wasn’t able to do those the things he wanted, so it was actually carried out by his grad student Patrick Finn who was doing it all by hand! I don’t know whether Patrick would be happy or sad that things that took him two years to do can now be done in two minutes. Your dissertation was about reading? Yes, that was one of the areas of application. It was an interesting project, because Bormuth was ahead of his time in understanding how the structuring of written
8 Robert Mislevy
101
material carries a lot of meaning: it’s an interplay of lexicon, structure, semantics, and pragmatics. In his automated algorithms for generating test items, he leveraged the syntactic and the lower level semantics of a text: so you could give him a book, and using his algorithms, you could turn the crank and generate a universe of so- called w- and h-questions, like “who what when where why how.” At the time, I would say that was probably the most advanced way of thinking about reading assessment. Now, it wasn’t advanced in the sense of getting at the fine details of shades of meaning and so on; researchers like Art Graesser are still wrestling with that and probably will be for some time, but his work was a great conceptual advance. The technology wasn’t there yet, but people are doing now what he had the idea of 40 years ago. What was your own research question? Well, the title was “A general linear model for threshold parameters in the Rasch Model.”2 Ben Wright was helping me think through the Rasch-related things a lot; the general linear modeling was from the multivariate analysis of variance work that Darrell knew very well, so he was helping me a lot with that kind of modeling. One of the applications that I was analyzing to model difficulties was reading tests, and John Bormuth was helping me with that. Where did you go after finishing your dissertation? I stayed with Professor Bock for 3 years at the University of Chicago. The National Opinion Research Centre was affiliated with the university, and Darrell worked some of his large applied projects through this center, so I had the privilege of working with him on some large important projects in educational assessment, like the California Assessment Program and the Profile of American Youth Study. The latter used a test battery called the Armed Services Vocational Aptitude Battery, or ASVAB. They gave these tests to large samples, about 8000 young adults. At the time, Darrell was developing his marginal likelihood methods for estimating item parameters in IRT models, and we would apply those methods to data that came from the Profile of American Youth study. It was during that time too that he and I together wrote the BILOG program,3 which ended up having a lot of use after that as well. I stumbled upon this BILOG program a couple of times in preparation for this interview; would you consider that as one of your main achievements? Probably so and it was a great learning experience. It was a great achievement in the sense of doing something hard and working on it for a couple of years, and getting Mislevy, R. J. (1981). A general linear model for the analysis of Rasch item threshold estimates. (Doctoral Thesis.) Chicago: University of Chicago. 3 Mislevy, R. J., & Bock, R. D. (1981). BILOG – Maximum likelihood item analysis and test scoring: LOGISTIC model. Chicago: International Educational Services. 2
102
8 Robert Mislevy
it working, so I feel great about that. We’ll talk later about one of your questions about personal achievements, but BILOG would be I think one that falls under work that I’ve done that fits in the themes of marginal maximum likelihood and Bayesian Inference. It’s a very tangible product that came out of these broader related issues I’ve thought about and worked with over the years. Can you identify the three main research lines in your work? To start off with that one: the topic of Bayesian inference is one of the major themes in my work, and that would include BILOG, but also estimating distributions of latent variables and populations, the incorporation of covariates about persons and items, and dealing with missing data in responses. The work that I did with Bayesian inference networks with Russell Almond4 is another part of that whole aspect as well. Cognitive diagnosis is one of the areas we applied it to. So, that’s one theme that goes all the way back to the first class that I took with Darrell. A second theme that is somewhat related to it also stretches back to those experiences, and that’s working with large-scale educational assessments of achievement. So I worked on those projects I briefly mentioned earlier together with Darrell, and after that I went to the Educational Testing Service (ETS). I worked on developing marginal estimation procedures for very efficient matrix sample data, in the National Assessment for Educational Progress. This was one of the largest projects that I worked on for a couple of years. It was an efficient enough design so that estimating point estimates for proficiencies of individual respondents no longer worked, so you had to use marginal methods. That’s when I and a team of other folks implemented the plausible values methodology5 that is still being used in a lot of other large-scale surveys today. You worked for a long time at ETS, you’re still working there, and you’re also still working at a University, right? Yes, I worked for ETS for a long time and I am back there again. For 10 years, I was a professor at the University of Maryland. I’m emeritus there now. Is there a difference between the type of job you were doing at the university and at ETS? I would say that if you had a list of categories of kinds of work, the categories are very similar in both places, but the weighting was different. At ETS I did less teaching. I taught some training sessions and seminars, but I wasn’t teaching a class regularly all the time. I sometimes worked with students at ETS too, with summer students and post docs, helping younger researchers, but not with a group of grad Almond, R. G., Mislevy, R. J., Steinberg, L., Yan, D., & Williamson, D. (2015). Bayesian networks in educational assessment. Springer-Verlag: New York. 5 Mislevy, R.J. (1991). Randomization-based inference about latent variables from complex samples. Psychometrika, 56, 177–196. 4
8 Robert Mislevy
103
students that I was responsible for at the university. I worked more on research projects and writing at ETS. At the university, I did that as much as I could, but there was simply less time. So for me, it wasn’t very different; for some other people there may be bigger differences. Adjusting the weights took a year or two to get under my belt, but it wasn’t that big of a difference to me. ETS works with a lot of real data of course: is the work you did at ETS more applied than the work you did at a university? Perhaps slightly so, but I have to say: all the best stuff that I’ve ever done has come out of applied problems. When I worked at the university, the most challenging intellectual work I did, other than helping on some students’ research—I hope at least some of that happened—was in two external collaborations. One was with SRI International with Geneva Haertel.6 We worked on applied projects and we developed a framework for model-based reasoning and other inquiry science assessment. I also worked with John Behrens and his team at Cisco, developing simulation- based assessment in learning computer network engineering.7 My graduate students worked together with me on those projects. There’s a book written by Donald Stokes, called Pasteur’s Quadrant,8 where he said that the traditional contrast between applied and theoretical research was wrong. He says that there are two dimensions: one, whether or not you’re working on current important problems, and the other representing whether you’re using fairly well-known tools and ideas, or whether you’re pushing on the boundaries of tools and ideas. He called that book Pasteur’s Quadrant, because Louis Pasteur’s work was at the same time was exploring new ideas about the nature and transmission of disease but also on how to halt then-current smallpox epidemics. I would describe the work that I did with Cisco as exactly in that quadrant. We were developing new things to get simulation-based training and assessment out on the Internet, to hundreds of thousands of students. So we are doing things on the fly and creating new design models, new ways of modeling student data in real time, sometimes a year, sometimes months before they were out in the field being tried by students. And the same is true at ETS too. One of the things I work on at ETS is game-based and simulation-based assessment, and those things also are very much cutting edge. To do them well, and this gets out another question that you had, is necessarily interdisciplinary. Like in the case of the CISCO Packet Tracer simulation environment9: in order to make it work, we had cognitive psychologists, network engineers, Mislevy, R. J., & Haertel, G. (2006). Implications for evidence-centered design ofr educational assessment. Educational Measurement: Issues and Practice, 25, 6–20. 7 Mislevy, R. J., Behrens, J. T., DiCerbo, K., & Levy, R. (2012). Design and discovery in educational assessment: Evidence centered design, psychometrics, and data mining. Journal of Educational Data Mining, 4, 11–48. 8 Stokes, D. E. (1997). Pasteur’s Quadrant: Basic science and technological innovation. Brookings Institution Press. 9 Frezzoo, D. C., Behrens, J. T., Mislevy, R. J., West, P., DiCerbo, K. E. (2009). Psychometric and evidentiary approaches to simulation assessment in Packet Tracer software. In ICNS 2009 Fifth 6
104
8 Robert Mislevy
instructors and curriculum designers, task designers, and psychometricians, including myself. At best, any one person was an expert in one area, maybe half in another area, but had to work at a high level with people with different expertise. And that work had to come together. You’re trying to make an artifact that has some function in the world, and it needed concepts from all of those contributing domains. That’s not only a lot of fun to do, I think that’s also where sometimes great progress happens, more so than in individual fields. So how connected is psychometrics to other disciplines? Could you speak to those people who came from other fields? Speaking to the statisticians would probably easiest, I reckon. Yes. They don’t always understand the domain content what you say, and they sometimes think you’re not being rigorous enough. But you’re working in the real world and the real world gets kind of messy sometimes. Knowing about statistics helps you do a better job to optimize what you have to do. Coming back to my research lines: the third line of research entails trying to understand what assessment really is in the bigger picture. This line of research is called evidence-centered assessment design, or ECD, and started with Russell Almond and Linda Steinberg. ETS is arguably the best in the world at doing the kind of test that they became famous for, but there were more and more projects arriving that were different. For this Cisco on-the-fly assessment and another project, we couldn’t just take that machinery off the shelf. Linda was the project manager for a second project, which was called Hydrive. It was another coached practice system; learning to troubleshoot the hydraulics subsystem in the F15 aircraft. Drew Gitomer was the PI of that one. Coached practice systems are a type of assessment. You’re trying to figure out what people are learning, how they’re thinking, where they’re having trouble, and you try to suggest what they have to do to learn better, whether it’s refreshing, going over what the components are that need to work together in the system, or reviewing space-splitting strategies. So you’re trying to do all these things on the fly, and that’s when we started using Bayesian inference networks: they’re well suited to modular inferencing. You have these link functions for situations, and you can build inferential models on the fly to suit the evolving situation. Bayesian inference fits perfectly: you have a more sustained set of variables, the variables that characterize proficiencies or knowledge or strategies that people seem to be using. You can update these variables and link on the components that go along with the situation that they work themselves into. So basically: make an observation, update, move on, and then offer advice if it looks like the student needs it. None of that was on the standard psychometric tool shelf, though there have been some similar projects. ETS is a great place to think about these things, because at any one time there might be a hundred or more different assessment projects going on of all different kinds. Evidence-centered design is about looking for underlying International Conference on Networking and Services (pp. 555–560).
8 Robert Mislevy
105
themes in the reasoning, in the design, trying to understand the underlying structure, and that helps you rise above the particulars of any project. All these assessments that look very different on the surface are playing out in different forms, but with the same underlying evidentiary reasoning and statistical and design principles. Coming back to the question that motivated this: if you can have some structures that allow you to talk at a higher, more general level, and have some representatives from each field who can show, “well here’s our argument, here’s what we’re going to have to do in design,” it gives you a groundwork. Then you can talk more easily with subject matter experts, programmers who are going to implement this stuff, and systems architects, who don’t have to know what’s inside and capsulated in the object but need to know what objects have to “talk” with each other about. Having that groundwork was a great impetus for evidence-centered design. And that’s my third, and probably most exciting to me, line of research. I’m sorry to say this is the first time I’ve heard about evidence-centered design. Psychometrics is usually considered to involve IRT or SEM. Do you think that psychometrics should become a more inclusive field? Absolutely, and I think that’s a real problem. I think what’s happening in the world are developments in technology that are able to capture vast amounts of data. One of my friends at ETS, Saad Kahn, whose background is in computer vision, has done projects where people train to interact with people from other cultures. You don’t know the language very well at all, but it’s multimodal: you’re saying things, the avatars are saying things back, and their body language, your body language, and facial expressions are all being tracked. All these things are happening simultaneously in real time. When you look at the data, there’s like a gigabyte from a single interaction. And so, IRT isn’t the tool that you need. What you need is some of the same basic evidentiary principles that you’re going to build your argument around, and you’re in need of much heavier duty data analytics, but you’re ultimately trying to do the same basic things that you do with simple tasks, and IRT models. So, there’s one current branch that I work in which comes out of psychometrics. I gave a talk last year in Edinburgh at the meeting of the Learning Analytics and Knowledge conference, about what data scientists need to know from psychometrics. Part of the talk was about values of probability based reasoning, how they interplay with data analytic techniques for dealing with vast amounts of data. Sam Messick had a real nice quote in one of his articles; he said that validity, reliability, comparability, fairness, and I toss in generalizability, aren’t just measurement issues but they’re social values, and they have meaning and force whenever you’re making decisions about people.10 People in data analytics need to know that for what they’re doing. An important branch of psychometrics has been trying to figure out exactly what that means: how you can use tools of modeling and mathematics and design to Messick, S. (2000). Consequences of test interpretation and use: The fusion of validity and values in psychological assessment. In Problems and solutions in human assessment (pp. 3–20). Boston: Springer. 10
106
8 Robert Mislevy
embody those values. Lee Cronbach is my hero in doing that sort of thing. But how do you take those principles that are by now deeply embedded in the particular work that psychometricians do and practice, and say that those principles aren’t only connected to those kinds of tests and equations, but that they are pervasive and important for data analytics too? Even when you’re not using anything that psychometricians have ever heard of. Sometimes we can use the machinery as is, but sometimes we have to adapt it, reinterpret it, extend it, doing things that are analogous in principle and in concept, to deal with these important, pervasive and I would claim, long- lasting, values. The data change, the psychology changes, the statistics change, but these principles don’t. And I think lots more than just psychometricians ought to know about this. In many people’s views, the psychometricians are there to help the psychologist, or at least work together with them. They apply psychometric methods to psychological data, testing data, but you make it sound like psychometrics is useful for all sorts of data. Or do you consider all those data types also some kind of psychological data? Cronbach talked about “behavioral data.” I think psychometrics deals with what people do, what they say, or what they’re making, and the psychometrician tries to deal with inferences from observations that have to do with something that people think, are able to do, or what they need to work on next, which is, inherently, psychological. There are people who might not think it’s psychological, but the nature of the inferences is. And I totally agree with you that most people think psychometrics is just about working with psychologists, and, in fact, they even limit their thinking to certain kinds of data, questions, and equations. I actually found it’s easier to get people to recognize the value and the use of psychometric techniques, if you don’t call them psychometric techniques, until you’ve worked with them for a while! They see it as ways of thinking, and modeling techniques that help them do what they understand and what they want to do. If you then say, “by the way, these ideas and some of these equations come from psychometrics’, then it’s all, ‘oh, good, I’m glad psychometricians thought of that.” But if you tell people who come from completely different fields, say, game-based assessment, that there is some them interesting psychometrics you’d like them to do, they say that they don’t need any psychometrics. “I’m sure that was very helpful to Charles Spearman 100 years ago, but I’m not doing what Charles Spearman does.” And they’re wrong, they’re making inferential errors, and they’re sometimes making decisions that are unjustified, invalid, unfair, and there are tools that can help them do better. Psychometricians have those tools, and not enough people know about it. That sounds like psychometrics has a big future, according to you. I would hope so, because it can go two ways. There’s a lot of very important knowledge that has been developed, but to a degree it is encapsulated in the kinds of data and problems that it evolved to solve. And being able to make explicit what those concepts are, to state them more generally, to cast a more general modeling view
8 Robert Mislevy
107
that reconceives or extends or analogizes those, doesn’t happen by itself. Many people outside psychometrics haven’t been thinking about those concepts in a broader sense. They’re stuck on one path, they make bad decisions, they get sued, their artifacts don’t work, or something else bad happens, and they have to rediscover, in their own context, what psychometricians have spent the last 100 years figuring out. That’s the unhappy future I hope doesn’t happen. But the happy one is that talks by me, by Mark Wilson, by lots of other people who are interested in spreading this knowledge, and putting it to work in actual projects and artifacts that make it visible how these concepts are useful, are ways forward. One of the things that I’m pleased about is seeing just as many citations to the evidence centered design work come from people who are not psychometricians. They’re doing game based assessment, or simulation-based assessment, and they are not finding anything in their field that helps them with their design and inference. They’re also not finding useful things in the basic psychometric literature, even though the ideas actually would be. So, the work that Russell, Linda, and I did in evidence-centered design has been helpful as a bridge, for the psychometric ideas to becoming helpful to people who need them. Are there psychometricians who very much disagree with you? I’ve had some interesting conversations with probably someone you know, Denny Borsboom. Well yes, that’s my supervisor! I thought you might know him. So his view on this is somewhat different than mine. I don’t know if I have the quote exactly right, but he talks about validity as whether this test is measuring what it’s supposed to be measuring. I think that’s a nice slogan, but I think there’s more to it than that. Philosophically, I think of latent variables from a constructivist realist position: there is no such thing as a “systems thinking theta,” within individuals. There are certainly schemas and activity patterns that people learn in the world and from their particular experiences, and they learn them because much of what we do when we interact with other people and objects is tacitly built around these schemas. So you learn to recognize them and act within them, but that’s somewhat different for everybody. There are enough similarities that if somebody is not doing well in these kinds of situations, we can characterize some of those major themes with psychometric models with a number of dimensions. However, it’s more complicated than that: it is also very useful to help students learn better, or show how they’re coming along, or compare programs. Do I think that the thetas in that model that I’m using exist as different amounts of the same thing in different individuals? No, I don’t believe that. I do think there is something real, but I think it’s more individualized than the thetas can say. But this is a great contribution of psychometrics: realizing that this is a very productive way of thinking in a world that is vastly more complex than we can fully appreciate in a model.
108
8 Robert Mislevy
I’ll just say one last thing, and I find this amusing as well. Even though his philosophy and my philosophy are rather different: if we would both do validation for some instrument, we would pretty much do about the same things and probably make the same inferences. We disagree on what validity is, but we agree on how you’d do validation. It sounds like you also want to contribute to psychological theory. You’re interested in the different learning mechanisms. Absolutely! What I’ve been reading and thinking about a lot more in order to do game-based and simulation-based assessment, is situative, sociocognitive psychology. That’s more tuned to in the minute interactions and building individual resources. I think Denny might appreciate that as well. I also said to him that even though I don’t agree with everything that he thinks, I’m really glad that he’s doing the work that he is, because, if nothing else, even if he’s totally wrong and I’m totally wrong, it keeps other people thinking about it as well, and maybe they’ll be right. A very simple question: what is your most cited paper? That would be the one called “On the structure of educational assessment,” and that’s the one I did with Linda and Russell.11 I can say a little bit more about that paper, connecting with the future of psychometrics as well. I had a very important experience when we were working on Hydrive: we didn’t have tools off the shelf to do what we wanted to do. One of the people that I ran across in that work was David Schum, an evidence scholar. That was very helpful to me and he bumped my thinking up a level of abstraction about what it is that we’re actually doing. He took work from statistics, reasoning in the presence of uncertainty, but also from philosophy and engineering; all of the subject domains that do inference under uncertainty in their own areas. Psychometrics is just one of many, medicine another. Schum noted that perhaps one of the most important types of reasoning comes from law, legal reasoning, and a lot of what I do is based on argument structure. So what I tell my students is that when you’re learning about psychometrics, you learn a lot in the beginning by reading some books about reading David Schum’s book, The Evidential Foundations of Probabilistic Reasoning.12 When you read Schum’s book, you see from a bigger picture what is actually happening in drawing inferences. That was very helpful because that helps us being able to cast what we’re doing in Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). Focus article: On the structure of educational assessments. Measurement: Interdisciplinary research and perspectives, 1, 3–62. 12 Schum, D. A. (1994). The evidential foundations of probabilistic reasoning. Northwestern University Press. 11
8 Robert Mislevy
109
psychometrics as an instantiation of those principles. It makes it easier to understand how we apply those principles as data change, as psychology changes, and as the inferences we want to make evolve over time. Who do you consider the most important researcher? I thought a lot about this, and that’s really hard because there have been a lot of people who have made important contributions. Psychometrics isn’t like biology, where everyone would say Darwin. Well, it would surprise you though that after 13 interviews, there’s a lot of overlap, but I’m very interested to hear who you think it is. I would say Lee Cronbach. He made technical contributions, very important ones, but he laid down some real mileposts about how psychometrics isn’t just about measurement, but also about the quality and the nature of inferences that you’re making. For example, if you have a body of data, it’s not simply a question of the amount of error around the true measurement; it’s really about which inferences one wants to make. The same body of data might be very useful for some, garbage for another. There might be sources of uncertainty that are critical for one project, which are irrelevant in other projects. That’s a whole different way of thinking about what the data is. Cronbach also had a real engineering perspective in generalizability theory13— this whole idea of g and d studies—which hooks in with my interest in applications. G studies, in his case, were designing studies that were allowing estimating variance components that he would need to then go and design the artifact. The D study was what you’re actually going to get evidence. The D study could be much simpler in its qualities to actually obtain the data, but you need this G study, not to collect the evidence but to get the evidence about the quality of the D study evidence. That’s what you need to build your artifact, and that’s very engineering way of thinking: not a measurement way of thinking, more a design way of thinking. It serves measurement ends when that’s the inference you have to make, but there’s a lot more than that. A problem people have with psychometrics is that psychometrics is just about measuring people’s traits. Yes, you can do that if you believe there are such things. But psychometrics also contains tools for making sense along the lines that are substantive for the decisions you want to make or purposes that you have. It doesn’t have to be about measuring traits. It can be about the next intervention for a student, or whether you have enough information available to make a decision about whether people’s training is good enough at 9 months or whether they need a whole year of training. Questions like that are different from measurement questions.
Cronbach, L. J. (1972). Theory of generalizability for scores and profiles. The dependability of behavioral measurements, 161–188. 13
110
8 Robert Mislevy
I don’t know much about generalizability theory by Cronbach, but the only thing I’ve heard is that it has not been successful, or that only a few psychometricians have heard of it. Maybe that’s different in your field, I’m not sure. I don’t know about that, but it has been successful in influencing my thinking! But, here’s another downside of psychometrics. There’s a happy side and a sad side of psychometrics. The happy side is when you can figure out ways to capture and analyze and think about data that are going to help solve problems, and that feels very good. The sad side, which probably happens more often, is when people, by their intuition, gather some data and want to make some inferences, but the data tell them less than they think it does. And then, as a psychometrician, it’s your sad job to give them the news. Every now and then, they work with you early enough… …to prevent that. Yes, but this still happens a lot. It happens so many times that you’re called in after the fact, and the data just doesn’t support what people want. Partly because the world is more complicated than they think it is and partly because you have to work harder in design to get information than anyone realizes. Do you think that’s where Cronbach comes in? Yes. Before there was psychometrics at all, people have been doing assessment and examination for at least a thousand years. But, it was all done too simply: here’s a performance, some judgments, the decision, and we’re done. Quality of evidence wasn’t even part of that. Intuitively, people might work hard to get good evidence, but they didn’t have this framework of being able to characterize the quality of the evidence, let alone do designs for procedures to maximize quality of evidence, and G theory gives you machinery to do exactly that. Some of the applications of G theory had to deliver bad news so to speak: sometimes you just don’t have the evidence you thought you did for what you wanted to do. Cronbach’s last big applied project was the California Learning Assessment System.14 He was able to tell them that there was some good evidence for tracking process for the big schools. But there wasn’t enough information about individuals to make any kinds of consequential statements about the students, and politically speaking, that was one of the selling points for that program. That, along with other political kinds of problems, kind of killed the program. If you want to get a medical degree in the USA, you have to pass—among many other tests—two simulation-based exams. One is called Standardized Patients, where you’re interacting with a series of actual people, portraying a case. The other is a computer-based assessment of management problems, where you’re working through diagnosis, prescription, follow-up, what is and isn’t working, and what to Cronbach, L. J. (1995). A valedictory: Reflections on 60 years in educational testing. Washington, DC: National Academies Press 14
8 Robert Mislevy
111
do next. They worked very hard doing generalizability studies for that; it’s a very high stakes decision. Cronbach was on their advisory committee, and I don’t know if Cronbach said this or not, but basically, knowing what’s in generalizability theory, the National Board of Medical Examiners had two options: either they could do the kinds of things that Cronbach developed in G theory, and understand the quality of the evidence for making this high consequential decision, or, they could ignore it and somebody else would sue them, and then use generalizability theory to win their case. Not a lot of people like G theory; its original presentation is sometimes idiosyncratic in its notation and it’s not very familiar, a lot of universities don’t even teach it. Perhaps they should! In a way, some of the ideas are coming back in new guise too. Mark Wilson and Paul De Boeck do what they call “explanatory item response modeling,”15 and you see ideas of generalizability theory there as well. Out at Berkeley, Sophia Rabe-Hesketh is doing general modeling that includes IRT, but it also includes hierarchical models and multistage models. So the ideas of Cronbach in generalizability are coming back, maybe not with that label on it, but the ideas of how you model, and the quality of your evidence taking into account these important evidentiary factors, are working their way back in. So you would say psychometrics has much more to offer, besides only measurement. Yes, and I would say—this is one of the things that I consider a real contribution of psychometrics to society—that ways of thinking about how to solve measurement problems have turned out to be quite useful and powerful far beyond those particular problems. It’s just that nobody knows it! What are you working on now? I just about finished a book that’s on the articulation between sociocognitive perspective on learning in psychology and the measurement tradition.16 I’ve been on a number of committees with people who are proponents of one or the other, and sometimes they didn’t see themselves as having anything to do with the other, and a few times they even saw the others as the enemy. But I’ve had the fortune to have worked on enough different applied projects where you have to think about drawing ideas from both to start to see where connections are, so that’s what I’ve written about. I’m almost done with the book, and I will probably retire in maybe 2 years. I want to write a couple more papers and work on a couple more projects with other De Boeck, P., & Wilson, M. (Eds.). (2013). Explanatory item response models: A generalized linear and nonlinear approach. Springer Science & Business Media. 16 Mislevy, R. J. (2018). Sociocognitive foundations of educational measurement. Abingdon, UK: Routledge. 15
112
8 Robert Mislevy
people who are doing things in projects on simulation-based assessment and assessment in diverse populations. Those papers are continuous with my former work, because they’re about applying these ways of thinking about building arguments, building models, taking into account what’s happening in learning, and in the milieu of situations, in the learning and activity structures that people build their capabilities around. So as a final question, what do you think is the biggest challenge for psychometrics? This is a variation of other things that we talked about: there are very rapid advances today in technology, in psychology, and in learning analytics The biggest challenge for psychometrics is not getting left in the dust. All these folks are doing things that psychometrics has relevant ideas for, and they’re going to run into problems that are on the surface more complicated: there are fancier problems, and there are bigger data, but there are some of the same problems that Cronbach worried about 50 years ago. And these problems are just as important for our society, they’re just as important as making valid and fair decisions about individuals. Psychometrics has a lot to contribute and the challenge is connecting. We shouldn’t stay with comfortable, familiar problems, but help tackle the new problems that are emerging at a faster rate than ever. Thank you very much for this interview! A pleasure, a pleasure.
Chapter 9
Ivo Molenaar
“These people were doing something they called ‘factor analysis,’ which I had never heard about; I was totally ignorant!” Ivo Molenaar was professor of statistical analysis and measurement theory for the social sciences at the University of Groningen, and he was president of the Psychometric Society in 1997. Molenaar wrote his dissertation under Jan Hemelrijk and Van Zwet’s supervision at the University of Amsterdam in mathematical statistics. Later in his career, he became interested in statistical methods for the social sciences, like Bayesian statistics and IRT. He passed away on February 26, 2018.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_9
113
114
9 Ivo Molenaar
Thank you, Ivo Molenaar, for participating in this oral history project on the history of psychometrics. You’re a past president of the psychometric society. In this interview I will be asking you questions on your career as a psychometrician, on the relation between psychology and psychometrics, and of course on your view on the history and the future of psychometrics. Let’s start at the beginning; how did you end up in psychometrics? It was a long way. I started off in high school, of course. I won’t bother you with all the details about the Dutch school system, but I attended a variety of courses at school, mostly arts and sciences. So after graduating, I could choose any academic study I wanted. I finally decided to study mathematics at the University of Amsterdam, first with an emphasis on abstract mathematics, but in the second year, I got more and more interested in applications. I ended up with a major in mathematical statistics. The first professor in that subject, David van Dantzig, had died, and then his successor, Jan Hemelrijk, got me interested in applied statistics. When I say “applied statistics,” it’s not what you’d call applied statistics now; at the time it was very mathematical. The topics of measurement error and random samples rarely came up; applied statistics was a matter of proving theorems about convergence to infinity. It was very useful and very interesting. There was a research institute called “het Mathematisch Centrum,” later the CWI, Centre for Mathematics and Informatics, and they ran a statistics department of six or seven professional statisticians. Usually when you came there after your studies, like I did, you had to learn a lot and help other people with statistical problems; we did a lot of consulting. We taught some postgraduate courses to anybody who was interested in attending. And, of course, we also had time for developing ourselves, and the idea was that if you were good enough, you’d end up with a thesis. I was there for 8 years, 1962 till 1970, and in the end, I indeed wrote a thesis.1 The first topic I had selected turned out to be a bad idea because someone in Romania had already invented everything that I wanted to prove 30 years earlier. I switched to approximations to discrete distributions. A lot of statistics is based on the normal distribution, but there are also processes in which you count things and then you typically get a binomial distribution or a hypergeometric distribution, a Poisson distribution. Those distributions are discrete and formally related to the normal distribution. If you choose the right mean and the right variance, then maybe, hopefully, the distribution approaches a normal distribution. But when the data are discrete, the distribution never approaches a normal distribution in small samples. Many colleagues were interested in N to infinity, when everything is smooth and continuous, but my thesis was about what you should do if a distribution is not normal and asymptotic. In those days, this research was important because the possibilities of calculating distribution functions and confidence intervals etcetera was very limited; we only Molenaar, I. W. (1970). Approximations to the Poisson, binomial and hypergeometric distribution functions. MC Tract 31. Amsterdam: Mathematisch Centrum (now: Centrum voor Wiskunde en Informatica) 1
9 Ivo Molenaar
115
had primitive computers.. Nowadays, the computers are so accurate that even with binomial distributions for N = 500 or so, you can still exactly calculate each term from the previous one. When you did so in the old days, you’d end up with nonsense. We studied whether we can use this binomial variable in a transformed way. Where the variable is skewed, we made a distribution so it becomes symmetric, and then we worked on that distribution, and then in the end we’d switch back. And that was mainly the topic of my thesis. Who were your supervisors? One was Jan Hemelrijk, I already mentioned him. He was the professor of statistics at the University of Amsterdam. The other one was Willem van Zwet, who is only 2 years older than I am, but he was already professor of statistics at Leiden University; he was so brilliant and they wanted him badly. Like many others, he had studied in Berkeley; he had studied statistics with people like Neyman and Eric Lehman. My thesis was a rather practical thesis for those days, but it was still far from the actual dirty hands problems of people who were actually collecting data. I published some of the research from the thesis, and I went to a couple of European statistics meetings. There I got to know Ganapati Patil, who had a chair at Penn State. I don’t know whether it was his initiative or mine, but he asked me if I wanted to come to the United States for a year. Well, I was very interested! Of course! It was an honor. It was not unusual in those days to go to an American university after finishing your Ph.D. and before taking on a real job. They had better computers and did more advanced research. You could learn a lot there. So my wife, three young daughters, and myself went to America; it was my first transatlantic flight! I went to the department of statistics at Pennsylvania State. I hardly saw Ganapati Patil at all, because he was always either in Washington or in India, but we still collaborated and there were other colleagues who did interesting things. In December, the phone rang, and this was the dean of the faculty of social sciences at Groningen University. He said “Good morning Mr. Molenaar, we have a new joint chair, for psychology, sociology, and educational science, and we are looking for suitable candidates of a more advanced level than the obligatory courses which all the students take. Would you be interested in becoming this professor?” Wow! Of course! Even for those days this was a spectacular move, so I called Hemelrijk en van Zwet about it and they said “why don’t you do it?” The position wasn’t really mathematical, but it was interesting enough and the social sciences were developing rapidly. So I flew back to the Netherlands, went to Groningen for 4 days, talked to all the people—it was all very hectic—and at the end, the conclusion was that I could accept the job. I flew back to my family and I told them we were not going to Diemen (a village near Amsterdam where we had lived before), but that we were going to the north! I had already rented a house there for August, so that the school kids could start the new year in a small village called Peize, near Groningen.
116
9 Ivo Molenaar
This was quite the transition for me because, even though I had always been interested in the substantive problems of my clients, I had to learn about completely new disciplines. We have a handicap in the social sciences: psychology and sociology are not taught at secondary school, which is quite different from other disciplines such as biology and economics. These social science people were doing something they called “factor analysis,” which I had never even heard about; I was totally ignorant! One of your questions is, “do you consider yourself a psychologist?” Not at all! Clearly not. But I had to learn about it. While still finishing my year in Penn State, I read books about factor analysis, event history analysis, multivariate statistics—which I knew very little about—so that when I came to Groningen I was not totally ignorant anymore. I had a colleague called Robert Mokken, who had a similar chair in Amsterdam, and he had just written a thesis on developing a nonparametric measurement model for measuring attitudes and abilities.2 So I read Mokken’s book in my garden in State College. He was an ex-colleague of mine, and we started collaborating on these topics. After a few years, by trial and error, I found out which topics were interesting for my clients. Some of the clients were really interested in very advanced things, but most clients were just simply collecting data, and then coming to me and saying “here are my data, what should I do?” Were your clients psychologists? Mainly psychologists, but also quite a few sociologists. We had a professor of sociology Ivan Gadourek, he came from Czechoslovakia originally, and he had given sociology in Groningen a very quantitative flavor. At the end he was succeeded by Frans Stokman, also a famous name, a pupil of Mokken, and he did a lot of work on social network techniques, stuff for the sociologists. There were also people from education, but less so. Many people were from what we call in Dutch “pedagogiek,” or child studies; they’re generally speaking not quantitatively oriented. But there was also a pedagogiek department in Groningen which was interested in quantitative techniques, and we had the RION Institute for Educational Research, who did research at local schools. I was aware of the fact that my chair in Groningen and Mokken’s chair in Amsterdam were mainly new chairs. There was also Adriaan de Groot, the first methodologist of the Netherlands you could say. I met with him and read his book3 on methodology, which was very interesting for me and helped me enormously in finding out the problems and pitfalls in the social sciences. So although I didn’t meet him regularly, I learned a lot from him and his book. Later when he was retired
Mokken, R. J. (1971). A theory and procedure of scale analysis. The Hague, The Netherlands: Mouton/Berlin, Germany: De Gruyter. 3 De Groot, A. D. (1961). Methodologie (Methodology). Den Haag, The Netherlands: Mouton. 2
9 Ivo Molenaar
117
in Amsterdam, he had a summer house in Schiermonnikoog, a small island north of Groningen. One day a week, he would come to Groningen and we’d talk about his new book,4 which was about the scientific forum, and the idea that knowledge should grow by discussion between peers. I learned very much from Adriaan and others, like Wim Hofstee. So did you learn from them what kind of problems the social scientists had? Yes, and I learned from them that there is something called capitalization on chance which means that if you test everything you always get significant results. And to ask critical questions like “how big is your sample size?” “Why did you include those people in your sample?” “Why didn’t you include these others?” There were a lot of people who helped me with these practical questions, and I got very much interested in those topics, even more than in actually proving another limit theorem. I had a reasonable relation with my colleague in mathematical statistics here in Groningen, Willem Schaafsma, but he was very much about the beautiful mathematics, though he also had clients, especially in medical school. And at one point you started doing psychometrical research. That had a lot to do with the book by Lord and Novick, Statistical Theories of Mental Test Scores,5 very shortly before I moved to Groningen. And in Groningen I met a couple of people in the psychology department who were interested in studying that book. So we had regular meetings, once a week, when we’d study a chapter of that book. And again, discussing these things with other people is far more useful than sitting in your own chair and reading it all for yourself. Similarly, I was also part of a group which studied a book on mathematical psychology by Bezembinder6 in Nijmegen. That book was mainly about—if you’d put it in modern terms—multidimensional scaling (also completely new to me). Bezembinder and Eddie Roskam, a colleague of his in Nijmegen, were very active in this field, and they even developed their own software for it. Was this psychometrics? Well, in a sense yes. We had a small society for mathematical psychology and there is even a Journal of Mathematical Psychology. They publish highly abstract research which usually tries to do mathematical psychology from axioms, which in itself is fascinating. In practice it didn’t work, because you never had real data collected in such a way that these models would really tell you something about what you wanted to know. So I kind of left the mathematical psychology people, there were only a few of them anyhow in the Netherlands, and by that time I turned a little bit more to the international scene. I vaguely knew about the Psychometric Society and its journal, De Groot, A.D. & Visser, H. (2003). Het forumwaarmerk van wetenschap: argumenten voor een nieuwe traditie. Amsterdam: Koninklijke Nederlandse Akademie voor Wetenschapppen. 5 Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. 6 This book was possibly: Bezembinder, T. G. G. (1970). Van rangorde naar continuüm. Deventer, the Netherlands: Van Loghum Slaterus. 4
118
9 Ivo Molenaar
Psychometrika, which incidentally is as old as I am. Of course, I was not a member then. At the end of the 60s and early 70s, I got interested in the Psychometric Society, and, for the first time, they elected a non-North American as president of the Society. Here came Karl Jöreskog, who actually went to ETS for a year and developed his LISREL models there, which were abstract models for unmeasured variables. These models answered some of my queries about what can we do to incorporate measurement error in statistical techniques. Before Karl Jöreskog, there was Mel Novick. Novick was kind of my age, had written with Fred Lord that book on Statistical Theory of Mental Test Scores, and he had written another book with Paul Jackson on Bayesian statistics.7 And both were very interesting, and we invited him to teach a course about the Novick and Jackson book in Groningen. Because our people in chemistry and astronomy needed huge computers, we now had such a computer in Groningen. So we could easily implement a course with active participation from students, who could do their own analysis using the complicated software which was made available especially for that case. This was a period in which there were many younger staff members at all universities and also at CITO, the Central Institute for Test Development, the Dutch ETS. CITO was quite special, because there were no other European countries at the time with such an institute. There are a few of these testing agencies in America, because Americans use a lot of tests for admission and for qualification. Because CITO was a very lively place and they were developing interesting things, there were people from CITO who read about Novick’s book and asked if they could come and attend the course by Melvin. Looking back on my career, Mel was the one who inspired me most, both for the Bayesian part but especially also for the mental measurement part. He was also interested in what I had been doing. Two years later, I went to Iowa city where Mel was teaching, and I helped him develop the so-called CADA package8 which was a computer package for Bayesian analysis. That is I think when I started to become international, if you may say so. There I also met Charles Lewis, and Charles Lewis was teaching at Champaign Urbana, in the Midwest of the United States. Charles was looking for a place to spend his sabbatical, and since we got along quite well, we decided he’d come to Groningen for half a year. Those 6 months became 7 years, because Charles was disappointed about many things in the academic climate in the United States which was very competitive. It seemed to him that the Netherlands was Academic paradise; everybody collaborated with everybody, everybody was optimistic and interested, also on a national scale. In America, you have to wait several hours before the next train leaves, here we have a train to Utrecht, to CITO, to Amsterdam every 30 min, so he was very happy here.
Novick, M. R., & Jackson, P. H. (1974). Statistical Methods for Educational and Psychological Research. New York: McGraw-Hill Book Company. 8 Novick, M. R., Hamer, R. M., & Chen, J. J. (1979) The Computer-Assisted Data Analysis (CADA) Monitor. The American Statistician, 33, 219–220. 7
9 Ivo Molenaar
119
And what kind of research topics did you discuss with him? We had something we called N-group regression, which is now called multilevel analysis. It was very interesting because it also involved measurement and therefore psychometrics. There’s Kelly’s formula about regression to the mean, a simple case, but you can do this in a more complicated way, and that’s what we did a lot about. One of my own students, Anne Boomsma, was interested on the robustness of LISREL.9 In LISREL people assume everything is normally distributed, and that the covariance matrix that you’re supposed to calculate was indeed reliable, though this is probably not the case in small samples. Charlie did all kinds of things: he had lively consultations, he worked with people from CITO and he traveled. I now come to the point, I think. Karl Jöreskog got elected as president, as the first non-American. For many North Americans, that was strange, having a person there who didn’t even know how you carry a motion and how it is supported, how the whole process of decision-making within the Psychometric Society goes along; it was all quite new. But a few years later, they decided that maybe I should consider becoming an editor of Psychometrika, because they were looking for one. At the time, there were schools which were hostile to one another, especially in the domain of LISREL, so they didn’t want an editor from one of those schools. People from either of those schools might immediately reject a paper from the other group because of the hostility. They wanted somebody neutral, whom they vaguely knew. Melvin Novick and Charles Lewis knew me quite well and so did a couple of others from when I visited Bock in Chicago. They decided that I should be the next editor of Psychometrika. Again, you’re very lucky when this happens to you during your career. Sure, but, it’s not just luck right. It’s not just luck no. It also has to do with how you present yourself, what you have published, and what you still want to publish. I also got very interested in this whole reviewing process. The whole idea that you have anonymous peer reviews of papers was quite heavily developed by psychologists, more than biologists or other scientists. Is there a reason why the psychologists were so fond of that system? I think psychologists are interested in how people make judgments about persons or issues. And because there is also a human factor involved in psychological research, unlike research about rocks or animals, it’s more important that you have as objective as possible procedures to judge, accept, and revise papers. I did it for 4 years, and my university insisted that I should ask for a lot of money from the Psychometric Society; those Americans you know, they always want to express everything in money! So I got a partial replacement of my own job, because it took me about 2 or Boomsma, A. (1983). On the robustness of LISREL (maximum likelihood estimation) against small sample size and non-normality. Dissertation. 9
120
9 Ivo Molenaar
3 days a week to run the journal. And everything went by mail, in envelopes. Manuscripts arrived and you decided who would review them and you’d send another envelope to the reviewer and then after 3 weeks you sent another envelope to say “why didn’t you answer?” Some people were delayed and took a lot of time. And finally you had three or four reviews of the same paper, and you had to decide as an editor whether this was going to be accepted and what revisions are required. Being editor allowed me to stop doing certain jobs like management tasks, which I didn’t enjoy. It was considered an honor. I was the first Dutchman who was ever editor of this journal! After that, I just resumed my normal work. I had an awful lot of Ph.D. students. In the 29 years I taught in Groningen, I was the main advisor on 15 dissertations, which were usually about developing new models or new methods. At the same time, there were about 50 theses, for which I was the secondary advisor, because the primary advisor would be someone who was an expert on an applied area and needed someone for the statistics. I liked to do this very much. It is interesting to meet people who have their own discipline and their own ideas, but don’t quite know how to model this. Is it allowed to use this test in this circumstance? How large is the sample that I need? So as you mentioned, you’re not a psychologist. Did you begin to like the topic of psychology when you started working with psychologists? Yes, it would be illegal to call myself a psychologist, because psychologist is a protected title, like medical doctor. The NIP, Dutch Institute for Psychology, only allows people to work as a psychologist when they have a degree from a psychology department.10 The main thing is that I didn’t know enough about psychology. I learned a lot during these years, and especially I learned a lot about mental measurement and item response theory which became my favorite topic in the 29 years that I worked in Groningen, but I certainly wouldn’t call myself a psychologist. If anything, I am a mathematical statistician who went away from mathematical statistics to social science statistics. And would you identify yourself a psychometrician, or is that also one step too far? Yes, I would! I ran the journal and was elected president 10 years later, so yes, I am not shy of saying that I’m a psychometrician. When people ask me what I do on birthday visits, I’m always happy that CITO exists in the Netherlands. I can say that I help those people in CITO develop tests for your children. We have a long history in the Netherlands of people who distrust the idea that you can use numbers to measure people. There’s this famous paper by Thurstone, called “Attitudes can be measured.”11 His was one of the first attempts to measure “Psychologist” is not a protected title in the Netherlands. However, to protect psychologists with an academic background, the Dutch Institute for Psychologists (NIP) has instituted the title “Psychologist NIP,” which requires a Masters in Psychology and 1 year practical experience. 11 Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529–554. 10
9 Ivo Molenaar
121
attitudes. And of course, you don’t measure John or Mary, but you measure the ability of John to solve certain problems or the attitude of Mary toward racial differences. You don’t measure the person, but you measure a property of the person, and that is a very important distinction. That has fascinated me in all those years and that’s why I’d call myself a psychometrician. That’s what psychometricians do: they want to develop tools to help the substantive psychologists, but also the substantive sociologist or educational researcher, do better measurement and check whether their measurement has sufficient quality. So psychometrics has a supportive role? Yes, for many researchers they have a supportive role, but we also need people who take the subject further and who develop new tools. What Anne Boomsma did for the robustness of LISREL in small sample sizes is more than just assisting a psychologist. So psychologists go to APS or APA conferences and psychometricians go to IMPS. I sometimes feel that the two are not really mingling. But you say that it’s actually necessary to have a group of people, psychometricians, who study those models, who think of new ones. Yes, I’m convinced that’s indeed important. The development of this nonparametric model by Mokken, for instance,12 or the development of the Rasch model,13 could never have existed in a community where everyone is substantive psychologist. You must talk to your own pals, and if you don’t have enough pals, you take a train so that you can meet with Eddie Roskam in Nijmegen, or with Don Mellenbergh in Amsterdam, or the people at CITO. It is important that psychometricians have the right to have their own meeting place. And at the same time, it is also important to keep informed with the new techniques. For instance, even a statistician should know something about brain research, if you want to help people with processing such data. Psychometrics should be more than just those measurement models. Yes, ideally we should also study stochastic processes, for instance; we should study events happening at a regular time, like event history analysis. We should also work with brain signals. If you look at them raw, you don’t see anything, you need to change these data, transform them, and then, when you eliminate noise, you can do something with them. I had a colleague in Groningen who was very good at this and we worked together quite a bit. His thesis was about heartbeat and how it is influenced by cognitive tasks which you perform, for which I was the second advisor. You need a substantive advisor for these things, which is definitely not me, and you need a statistician who is critical about the data quality and how you could improve it. Mokken RJ (1971). A Theory and Procedure of Scale Analysis. De Gruyter, Berlin, Germany. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: The Danish Institute for Educational Research. 12 13
122
9 Ivo Molenaar
And what would you say is the biggest achievement of psychometrics? I think the whole system of inspecting the quality of your data, and developing better tests, better measurement instruments, has been enormously influential, for selecting people, for job qualifications. For me that would be the main part of psychometrics, but I’m aware that the answer to this question always depends on what you have done yourself, and I would call myself an item response theory person, if anything. It’s interesting because it was something I had never heard of during my studies. You’ve become a psychometrician throughout the years, not during your studies. There was a bit of factor analysis, say in the 1940s, but it didn’t have all the tools and statistics, and all the possibilities of checking whether a test is culture fair, for instance. That came in the 70s. Culture fair testing is another point in Melvin Novick’s work that has inspired me very much. I was so happy with Melvin Novick’s 2 weeks here in Groningen that we also asked Karl Jöreskog to teach LISREL and Gerhard Fisher to teach about the Rasch model. We also invited mathematical statisticians, especially British ones, because British statisticians usually have their hands in the mud; they really work with raw data and are interested in practical solutions to problems. The courses were always a success. Occasionally, you’d invite someone who didn’t have a connection with the audience, but usually, the people who came from abroad were surprised that people from all the Dutch universities, or at least from five or six different universities, would take a week off and come to Groningen to discuss things. And the students did not only listen to what the expat said but also discussed among themselves during lunch and evenings. I think that’s a formula which has always appealed to me very much. And has that continued over time? It has continued, though not quite in this form. What we did was “statistics in birdflight” which was a course for Ph.D. students, which provided just a little bit more statistics than the average social science student had had during his undergraduate career. There again, the mutual contact among the people became very important and someone would say, “didn’t you know there is a new publication by so and so?” and they helped each other immensely. And then IOPS14 got founded in 1987. Don Mellenbergh was one of the founders of IOPS; he was very good at bringing people together. We organized some courses and the meetings provided a nice opportunity for Ph.D. students to meet each other, since there were only so few of them.
14
IOPS is the Dutch and Belgian graduate school for psychometrics and sociometrics.
9 Ivo Molenaar
123
It was a small world. It was a very small world. All students had their own specialty in psychometrics, so it was important that they could work together. At the defense of the thesis, some opponents were often from IOPS and they were present at the ceremony. I’ve always liked that very much. The Netherlands is quite prominent in psychometric research, more so than other European countries. Do you have an idea how that happened? I once tried to interest the Greeks in doing psychometrics. But when there is a job opening in Greece and you look for a suitable person, you are told that the son of the minister would be interested in applying, and there is hardly any room for an objective comparative test for picking the best job applicant. Americans are very fond of testing people and using the test to predict futures, jobs, and also to test whether the test is culture fair and doesn’t discriminate certain groups. And a bit of that is also found in Scandinavia. To my great surprise, it is not found in England, as far as I know. There is the British Journal of Mathematical Psychology, but the British are not very interested in testing. There were some very influential British statisticians, who really disliked it. Harvey Goldstein from the London school was very much against all that I did. What was his argument? I never quite understood. But, for instance, he was very much against item response theory. In England, there was a tradition in the multivariate normal domain. Roderick McDonald is an example of someone who was doing work in the multivariate normal tradition, and he’s doing very good work, but psychometrics was not a great success in England. France had a tradition of using data matrices, for instance, correlation matrices, and do matrix algebra with them in a very clever way, and then come to some representation, for example, the first two principal axes. That area was often practiced by Jan de Leeuw and his followers. It’s a completely different ballgame, very different from what we did in Groningen or what Don Mellenbergh did in Amsterdam. I never really understood how you could develop valid knowledge from such manipulations of matrices. There was a section in the journal of the Dutch Statistical Society, which organized a debate by Jan de Leeuw and myself in 1988, where I was supposed to challenge his view about how you can “put it all in a matrix, you look at the pictures and you can see what it is.” Although later on relations were not quite friendly, there was no hostility with between me and people like Jacqueline Meulman or Willem Heiser. However, traditionally, they did things in a different way. There were never really tries to eliminate the other group. But it’s still true that there is a tradition in Leiden that doesn’t exist anywhere else. In Nijmegen, they also had this tradition, but there it more or less died out when Eddie Roskam passed away.
124
9 Ivo Molenaar
So going back to your own career, what do you think was your most influential work? I think that would be my presidential address which I reread today to be well prepared: “Data, model, conclusion, doing it again.”15 It gives a nice view of what I think is important for the whole of quantitative social sciences. It’s something which has always intrigued me. When you blindly calculate a problem with standard error, what does that really tell you about the substantive problems? And I always tried to teach my own students, and also those who were developing new methods, that they should do consultation, that they should talk to the people who do substantive work, to the problem owners. There’s one Dutch pun that I’ve never been able to translate to English. When you have collected your data and done the analysis, you see your data on the computer screen, you can see what it is, you can reflect on it, you can “nadenken”16 or think about it. But now, imagine that you’re in a situation when you don’t yet have any data: what can we already decide on what will be the core questions? What will be the core issues? And I call that “voordenken,” not “nadenken,” so “prethink” not “post-think.” Imagine that the data are already there, what would you do, what are you really interested in? In many cases you’d discover that the original problem formulation is not at all adequate for what you really want to know. Just 2 weeks ago, there was a new professor in Groningen, Marieke Timmerman, and she mentioned the “voordenken” and the “nadenken.” I can’t translate it. Anyhow, that is something which I think is very important, and of course, trying to keep the link between the substantive researcher and the formal modeler. What do you think is the most important work ever written in psychometrics? It’s always very difficult to choose, but I think I would use Lord and Novick’s Statistical Theories of Mental Test Scores, because it is on the transition of the old classical correlation and classical test theory-based models, to the item response models and latent trait models. You can see in the book that it was written by two authors. Fred Lord was the classical one and Mel Novick brings in the logistic models, and for the psychometric community as a whole, that was definitely a very important step. And it has remained; other interesting developments have come and gone. Multidimensional scaling in its classical form is kind of dead; I rarely see it used. But these models for measuring mental abilities and mental attitudes are still heavily used, and I believe that they will continue to be used in the coming years.
Molenaar, I. W. (1998). Data, model, conclusion, doing it all again. Psychometrika, 63, 315–340. In Dutch, the translation of “to think” is “denken” or “nadenken.” But the prefix “na” also means “after.” Ivo Molenaar wants to make a distinction between thinking about problems after the fact (so, literally speaking in Dutch this would be “na-denken”) and thinking about problems ahead. This would not be “nadenken,” but “voordenken” (the prefix “voor” means “ahead of time”). “Voordenken” is not an actual word in Dutch but is given by Ivo as a contrast with nadenken. 15
16
9 Ivo Molenaar
125
What do you think lies ahead of psychometrics? What is the next step? This is very tempting, and very dangerous for me, because I retired in 2000, and I stopped going to meetings and reading journals in 2004 or 2005, so I don’t know what everyone is doing. But I occasionally do see things: they have more computational possibilities now and have what they call “big data,” which means that they collect enormous amounts of data. I’m getting old-fashioned, but I think maybe you shouldn’t collect so many data, because it’s only going to cause you problems. Like? Like overfitting, for instance, or interpreting a very small standard error in your formal model, forgetting there that there is a far larger error because your model doesn’t quite fit, because your model is not adequate. Models are best possible approximations of reality, and there is always this idea in item response theory that everybody should just have one latent trait value, with a certain probability distribution of his answers. Well, that isn’t true: there are always strange dependencies and weaknesses and very erratic persons who do not follow a model. When you have a computer that does everything for you, it gets very difficult to be critical about what’s being calculated. Maybe this is just a story by an old man. So psychometrics should open up to more techniques but you have some concerns about big data. Yes, and there should always be a Socratic discussion about what we are really doing. And that is also a reason why I think having regular meetings, both national and international, is important because there you can talk about these things. The other point with regard to the future of psychometrics is that now, everybody can write just about anything and put it on his own webpage: the peer review and the respect for accepted publications is in danger. People seem to pretend that publications exist as soon you have written it down. And in that process, there has not been a reviewer and an editor who has pointed out the weaknesses, said “you should prepare this and otherwise we don’t publish it.” When I was editor of Psychometrika, I learned that this whole review process is very important. You are very experienced with statistical mathematics, and it’s a field that psychometrics is affiliated with. Do you think that psychometrics can also learn from other scientific disciplines? Yes, I think so. The econometricians are using complicated multivariate dynamic models, and they are much better at using these advanced techniques. More generally, now that so many more data are easily collected, we have more possibilities of three-way data, when time and persons and variables are three things you want to combine. That’s something psychometrics might benefit from. But as I say, I don’t follow it closely anymore, so I’m not the best person to help people shape the future.
126
9 Ivo Molenaar
Well maybe, it is interesting for new psychometricians to hear these things, since you’re so experienced in the field. I think we’ve come to the end of this interview. Thanks Ivo! My pleasure!
Chapter 10
Susan Embretson
“Testing is important in so many areas that I think that measuring the right constructs is getting more important than ever.” Susan Embretson is professor of quantitative psychology at the Georgia Institute of Technology and was president of the Psychometric Society in 1998. Embretson wrote her dissertation at the University of Minnesota in 1973 under Renee Dawis’
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_10
127
128
10 Susan Embretson
supervision. Her research interests include item response models, validity, and the cognitive basis of item responses. Thank you for being here, Susan, thank you for your participation in this oral history project on the history of psychometrics. In this interview I’ll be asking different questions on three different themes. One is your career as a psychometrician, one is the relation between psychology and psychometrics, and lastly, your view on the history and future of psychometrics. And I always start with the question how you ended up in psychometrics. Right, there is a story behind that. When I was in high school, I asked the high school counselor if I could go into psychology without being a doctor or a psychiatrist first. She said yes, I certainly could. So I considered psychology as a possible major. But, I had also been influenced by my high school Spanish teacher, who was an excellent teacher; she brought the whole Spanish club to Mexico on a trip one summer. So, I considered that major also. As an undergraduate at the University of Minnesota, I took my first psychology course. Of course, when you take such a course, you have to be the subject in some psychological experiment; thus, I was exposed to research on word associations and memory tasks. Additionally, there was even a rat maze experiment that I participated in as part of the course. And I thought, you know what, this is not for me. What about psychology was not for you? If psychology covered the topics emphasized in the introductory course, I didn’t like the field. And so I decided to go onto Spanish. I took all the courses and spoke it very well. But then I didn’t want to take the various literature courses or the education courses required for teaching Spanish. So I quit school. I was a college dropout and I went to work for a year. But I decided that having a job without a degree did not lead to where I wanted to be either. Thus, I enrolled in two night-school courses in the same quarter: individual differences and statistics. That is when I found what I wanted. The person who taught those courses also taught regularly in the psychology department, but he happened to be teaching both those night schools to earn some extra money. He then eventually became my advisor. As an undergrad, I decided to major in psychology, but I will say; I must have been kind of an odd student anyway. The first book I ever bought outside of course requirements was Psychometric Methods by Guilford!1 So, finally, I had found what I wanted. And after that, I had no doubts; this was what I wanted to do. So you considered a fair range of possibilities, but you found your destination in psychometrics. I would say that is right. Now, there was one other aspect that I think has been important in my career from those early experiences, and that was my experiences with testing. The Preliminary Scholastic Aptitude Test was administered when I was in high school. At the time, the test involved items with exceptionally obscure Guilford, J. P. (1954). Psychometric methods (2nd ed.). New York, NY, US: McGraw-Hill.
1
10 Susan Embretson
129
Victorian vocabulary. Now this was not my interest, and so I didn’t score very well on that test, but I came away with the idea: why is this measuring my ability to go to college? I didn’t qualify for a Merit scholarship though I needed a one. Fortunately, Al Johnson, who was an engineer who had built the tallest buildings in Minneapolis, decided he could support ten students. I ended up being one of those students. I thought it was interesting to be supported by an engineer; the only college background in my family was in engineering. So I was able to attend undergraduate college at the University of Minnesota. Subsequently, I applied to graduate school. A prerequisite for selection was scores on the Miller Analogies. During testing, I thought that the basis of the test analogies was really strange. Why were these items measuring intelligence? For example, one of them was: Moscow is to Vodka, as Copenhagen is to…? Now, I didn’t happen to know what the Danes liked to drink. And I thought, is this measuring intelligence? Another problem in the Miller Analogies Test concerned not fully explaining the kinds of relationships that might be considered valid. For example, there was an analogy towards the end of the test, which concerned words spelled backward. Now, I noticed that, but I didn’t think it was a valid relationship. So, that was yet another problem that I noticed in ability testing. I decided that I wanted to study testing in a different way than has been done before, that is, examining the cognitive basis of examinees’ responses. This led to my dissertation topic at the University of Minnesota on analogical reasoning that studied examinees’ perceptions of types of relationships in analogies. Several subsequent studies concerned several aspects of analogical reasoning. For example, what happens to processing and the external correlates of the test when you change the mixture of relationships? So those early experiences had an impact on me, and frankly, I haven’t given up those questions at all. I still research the same thing: study examinees’ cognitive processes in item solving and the impact of processing differences on other aspects of validity. And you were supervised by Rene Dawis, right? What was he like as an advisor? He was both a nice and very smart fellow, but he was overburdened with duties. He probably had 25 advisees at that time, so I didn’t have as much interaction with him as I might have wanted. During the last 2 years of my studies, I worked on his project, and it so happened that he was funded by the military to study analogies. The military test, the Armed Service Vocational Aptitude Battery (ASVAB), which is used for recruit selection and classification, has several subtests. I believe that at the time they were considering implementing analogies on the test. I will say, at the time that was not unusual for academics to have military funding for testing. Starting with graduate school, I have had many projects on military testing.
130
10 Susan Embretson
So in your time as a Ph.D. you studied analogical reasoning. Yes, that was part of it. Let me put a context on it. When I was in graduate school, classical test theory was still the main theory in measurement. However, IRT was just coming in, and that was an exciting time. My graduate student colleagues would be finding articles on IRT, and we’d be reading them together. In my advisor’s project, we in fact implemented the Rasch model. Thus, we had some of the early programs for estimating parameters. We were in touch with Ben Wright in Chicago and received his early programs to do the analyses. Shortly after that, David Weiss, another professor at the University of Minnesota, had a military project on computer adaptive testing. Of course, IRT was found to be the primary means of implementing adaptive testing. So IRT was definitely a big part of my dissertation, during the time that it was in its initial phase in the USA. How did you continue your career? What happened after your Ph.D.? I was looking for an academic position, and I had a couple of offers. Interestingly enough, one of them was from the University of Georgia—I am in Georgia now— and I turned that one down. Sometimes in quantitative psychology, the assistant professors get totally burdened with teaching. Georgia was going to be very much like that, and I thought, this is going to overwhelm me and I will not get a chance to do any research. The University of Kansas on the other hand was very open to whatever research a person wanted to do as long as it was done well (i.e., meaning the research would get published). Further, I would have more time for research there with the less demanding teaching load. But also, because they did not have a quantitative program, I joined the experimental program. Thus, I had more direct connections with cognitive psychology, which greatly increased my expertise in the cognitive area. Eventually we did form a quantitative program, which was a great development. However, I think that my early experience in cognitive psychology and contact with cognitive researchers contributed substantially to my research. Can you identify the most important themes in your research? The first theme is understanding examinees’ processing of items, from a cognitive perspective. So that means examining how aspects of item design will impact what the examinee’s responses. That, of course, impacts validity, which is a major theme in my research. The second theme is developing psychometric models which would be appropriate for that type of study. So, in other words: I wanted to estimate the impact of these different kinds of processes in a particular item, and that required the development of new psychometric models, which is something I am still pursuing. I have a couple of new questions about those models which I hope to examine soon. And there is a third theme: validity. This theme can be traced back to graduate school. In fact, for in my preliminary Ph.D. examination the committee asked me a question about it, namely, “what is construct validity?” My answer was that I just did not understand it. They took a quick glance at me, wondering if I was going to
10 Susan Embretson
131
fail, and then I said, “I do not understand how it is you can develop a test when you do not know what it measures until after you develop it.” Fortunately, they accepted my answer. However, I have retained that perspective throughout my career. My first paper on construct validity was in 1983,2 which included the cognitive processing perspective. What has happened since then is that the response processes aspect of validity is now one of the five aspects of validity, at least in theory. I think that is great. I have two more recent papers which include developments in validity as a major component,3 and I need more time for a broader viewpoint. In those papers, I have only considered cognitive traits, ability and achievement, but I think there are certainly some implications for personality which I have just done a little research. So validity has been a theme that underlines my research. I also read that you were working on automatic item generation. That actually interfaces with understanding the cognitive processes. You can generate items and develop item structures without understanding processes. When I started working with the military, there was some interest in abstract reasoning as measured by matrix problems. So the first item generator I had was developed in the 1990s to generate matrix problems. I also had a cognitive model of matrix problems to understand how features of items impact processing. An item generator requires two things. You have to identify a structure for an item, and then you need a database and a sampling scheme. If you empirically try out new item variants from a structure, you find out how similar those items are in difficulty. The answer is, from my research, very similar. The correlation between variants is in the high 0.90s. But you can also do something else. Let’s say you’ve got a new structure, which is a combination of features in an item. In the case of matrix problems, the features were the number and the types of the relationships, and also perceptual features in the cognitively based prediction model. The cognitive model predicted item difficulty correlates about 0.80 with item difficulty. So, it does work! So if an item generator can produce items with highly predictable difficulties, full-scale try-out of these items is not necessary. You could have a very reduced empirical try-out, if any. Have there been applications of item generation? Well, testing is the hardest thing to change. I have two item pools out there, one is used to select people in a certain profession, the other one, an abstract reasoning test, is actually a more generally available pool which is used to select people in a variety of contexts. I do have other item generators; for example, one that generates spatial
Embretson, S. E. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179–197. 3 Embretson, S. E. (2016). Understanding examinees’ responses to items: Implications for measurement. Educational Measurement: Issues and Practice, 35, 6–22. Embretson, S. E. (2007). Construct validity: A universal validity system or just another test evaluation procedure? Educational Researcher, 36, 449–455. 2
132
10 Susan Embretson
ability items, which are in the ASVAB military test. A couple of years ago, I delivered five thousand generated spatial items for the ASVAB. Have they tried them out? No, they haven’t done it yet! So, I am still waiting to see how well those items work. The most recent item pool I have generated consists of math problem-solving items. That was just completed about a couple of years ago. The items passed a full- scale tryout, so the items are fine in that sense. But, the problem with those items is that they’re achievement items. Achievement testing is one of those areas, especially in the schools, that fluctuate all the time. My five hundred items were tied to the achievement standards of a particular state in the USA. Thus, right now I am in the midst of a new proposal to link the items to a system of mathematical skills called dynamic learning maps and that is more general. That is, the maps can apply to all sorts of different blueprints of particular skills that specific states might have for a test. The proposal is being evaluated. I am hoping that that will link these generated items to the more general map framework, which can be used not only for formative assessment to determine where a student might lack skills but can also help in summative assessment, to select skills and new items and so on. So I am working on it, but testing is really show to change. They prefer certain formats. Well, no, it is just inertia. In a way, since it takes so long to change, it keeps me around a lot longer! Is that one of your goals, to change testing? Absolutely, absolutely. I really think that the typical way of developing tests, where you give item writers some vague instructions to make certain kinds of items, such as being told “here is a test, make new items kind of like these,” is not very effective. They write the items and then they are submitted for empirical tryout. Whatever is the predominant skills/processes/abilities in the collection of items defines the central trait. Thus, we are not very close to designing a test for validity with that kind of a method. I far prefer that one understands a bit about how certain features on items impact their difficulty and their discrimination and then determine what trait to emphasize. That, in turn, impacts other aspects of validity, such as external relationships. I recognize a lot of psychology in your work, is that a valid assumption? Absolutely, absolutely. More than in the work of other psychometricians? I would say so. There’s a new breed of psychometricians who seem to have less substantive background, and I do not think that is a good thing. I think they might be dealing with rather narrow statistical issues that are not really going to make a difference in the discipline or what is being applied in measurement. So I really see a necessity to keep quantitative methods attached to a discipline so it can influence that discipline.
10 Susan Embretson
133
So you think psychometrics should always be closely connected to psychology? Yes I do. When you’re here at IMPS, do you encounter that narrowing down of psychometrics? Well sometimes I do, and sometimes I do not. There is one area that is heavily into the design and understanding of items, namely, diagnostic modeling. That area is now fairly large. The only difference from my area is that their goal is to diagnose a class. Diagnostic models are basically item response models, but what is measured on a person side is latent classes that impact the patterns of competencies. But you cannot assess the classes without first defining attributes in items. Thus, the diagnostic modelers are analyzing and scoring items for the skills and attributes that are involved, which leads to examinees being diagnosed with respect to those skills. Hence, that research area is much closer to psychology. So there are psychometricians who stay close to psychology, but you also have a movement that is becoming more distanced from psychology. Yes, I’d say that is right. So when you introduce yourself to a group of people that you do not know, and they ask what you do, would you say you’re a psychologist or a psychometrician? Well, if it is a person who doesn’t understand anything about the field, I tell them that I am a psychologist, but I teach quantitative methods. But you identify yourself more as a psychometrician than a psychologist. Yes, I do. When you look at psychometrics from a broader perspective, what does it contribute to society? Testing is just entrenched: it starts out in the grade schools when kids are assessed for learning. Those tests start much earlier than ever now, when they are 3 or 5 years old. Grade school children are assessed in achievement areas, such as mathematics, reading, science, and so on. What goes along with that of course is that it has become more controversial. The individual intelligence tests are often administered when a particular child is not doing well, and they are probably administered at the gifted end also, for selection into special classes. And testing is present in education all the way up the line. You get selected into colleges based on your scores, tests such as the SAT, ACT are in the mix, and of course graduate school! Cognitive tests are also often used in industry, to select people for certain positions, as used in the military.
134
10 Susan Embretson
And there is also a big movement in personality and attitude testing which is certainly used in selection and placement. Testing is important in so many areas that I think that measuring the right constructs is getting more important than ever. Psychometrics is not dying out. Not at all! It has a bright future in that sense. Yes, it is even increasing. So when you look at your own career, what do you think is your most influential work? My most cited work might be a textbook. I do not regard that as my most important work. But let me tell you about that textbook though, because how it came about is kind of interesting. In the late 1980s, I was on American Psychological Association’s committee that concerns testing issues, including complaints, new prospects, and so on. We were concerned that although item response theory was out there, it was not being used. And I think it was in the 90s, when we decided to organize a symposium at APA titled: “What every psychologist should know about measurement, and doesn’t” and one of the things they didn’t know about was item response theory. So my paper was on the importance of IRT and that psychologists were not applying it. I was trying to convince someone to write a text book on item response theory, but I didn’t see anybody very enthusiastic about that. So when I came away from that symposium, I walked out and thought to myself: “you know what, nobody is going to do it. But maybe I could do it, if only I had a certain co-author.” Immediately, Steve Reise came to mind. Just a few hours later at a reception, there he was. And I approached Steve Reise and I said: “Steve, let’s write a book on IRT together.” “Great” he said! That is a done deal. And so that is what we did. We entitled the book “Item response theory for psychologists.”4 We will delete psychology in the title in the revision because the book has become more generally used than that. But what we tried to do is explain IRT in a context that would be more familiar to psychologists. We have a lot of data in the book that tries to appeal to more common sense inferences. There were other books in education, which were good; Ron Hambleton and Swaminathan5 had a book, for example. But I was not really able to get much interest in reading that book from psychology graduate students. It was pretty formal, in terms of all the Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. NJ: Lawrence Erlbaum Associates. 5 Hambleton, R. K., & Swaminathan, H. (1985). Item reponse theory: Principles and Applications. Kluwer Nijhoff. 4
10 Susan Embretson
135
developments and derivations and that was not interesting to the psychology students. And the examples were only in an educational context, another problem for the psychology students. So we tried to minimize the derivations and use a more common sense type of framework. But also of course, we included extensive applications to issues that would be more familiar to psychologists. We published the book in 2000, and it’s still getting citations. We have been having trouble getting our revision together—a lot has happened in IRT—but eventually we will finish it! I was wondering: you’d like psychologists to use that book, implement IRT models. Is that the task of psychometrics, to help or educate the psychologist? Oh no! People from other fields buy the book too. Anyone who is in a context that requires measurement are potential buyers of the book. That is why we are getting rid of the psychologist referent in the title because we have always had a broader audience. So what fields do they come from? The medical fields, sociology, education of course. So, a fairly broad audience. Educators might use both our book and a book like Hamilton and Swaminathan as well. So it’s your most cited work, but it’s also a more general work in a sense. Yes it is general! What do I think is my most important work? Okay, that is different. In 1999, I gave the presidential address to the Psychometric Society and that was work hot off the press.6 I had just finalized the development of an item generator and I had all of the cognitive research underlying that. I proposed a new psychometric model and discussed some implications for validity. I kind of ran the whole scope of what I had been trying to do; so that is an important paper. I have a second paper, which is in press right now,7 and it basically has a similar scope but it looks at some item types in some detail, trying to understand them from a cognitive perspective. It also tries to link that cognitive understanding to test design, and that link is accomplished through considering validity from a different perspective. I also write a little bit on my most recent psychometric model which is one for diagnosis. That paper is forthcoming; hopefully it will have some impact! You think it’s important. Yes. I think what I have to do is to write a couple of other papers though. I think that the construct validity issue and its relationship to design is something that really needs a broader audience. What I have to do though is to get more into personality Embretson, S. E. (1999). Generating items during testing: Psychometric issues and models. Psychometrika, 64, 407–433. 7 Embretson, S. E. (2016). Understanding examinees’ responses to items: Implications for measurement. Educational Measurement: Issues and Practice, 35, 6–22. 6
136
10 Susan Embretson
and attitude testing as examples, and that is going to be a bit work. But I know there’s some work like that going on, so there’s a basis. Actually, it’s been well known that there are problems with those measures, like response styles in personality, and that has been known for a long time. The issue is, how do you design those measures to minimize those problems on the part of examinees? So, there’s research to collect down there. On the one hand, it would be nice if things really change in the field of testing of course, but it’s also exciting that there’s still so many things to be done. Right, well, as I say, because testing has been so inert, it keeps me busy for a long time! If you look at the history of psychometrics, what work, article, or book has really inspired you? Well, when I was in graduate school, I went to early conferences AERA and NCME, because that seemed to be interesting, and John B. Carroll gave a talk. John B. Carroll really was making the case for trying to link items to what people are doing in the processing sense, so that was very influential. I consider John B. Carroll my male academic role model, because he made both psychometric and substantive contributions. His book where he re-analyzed all those measures of aptitudes, trying to look at the structure of aptitude, is pretty influential as well.8 So I would say he had a big impact, and I did get to know him during later years, which was really quite nice. So, you also have a female role model? Anne Anastasi. She wrote a significant book on individual differences and also on testing.9 I did meet her and got to know her a little bit as well. I was asked to give a talk at a symposium, now a yearly symposium, shortly after her death. So I went back to this book she wrote on individual differences, I opened the pages, and it showed all of my notes all over the book! I was so impressed by her concepts, her way of interpreting data and its meaning. It was really quite good. Part of what I was so impressed with had to do with behavioral genetics and how she would interpret whether you can attribute behavior to genes or environment and how that was a question you couldn’t even ask. The way she went about and dressed it was really well done I thought. Of course in her book on testing, she discusses validity and I liked her concepts quite a bit as well.
aroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. Cambridge University Press. 9 Anastasi, A. (1937). Differential psychology. New York: Macmillan. Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River: Prentice Hall. 8
10 Susan Embretson
137
So there are more and more women now at IMPS. They’re around. How was that back in the days? Did being one of the only women ever bother you? I have nothing but good relationships with my colleagues and that goes way back. There was a meeting to be held in Sweden, in Uppsala, when Karl Jöreskog was the president, and I really hadn’t dared to go. Because they were worried that not enough young people would go, there were ten NSF fellowships to support young researchers. My colleague, David Thissen, told me about that, so I applied and I got it. Of course I was really intimidated by all these greats in psychometrics, and here I am, my first meeting. I had handouts of my presentation, but when I got up and started talking, I realized, I hadn’t handed them out! Darrell Bock came up to me, took some and handed them out, very supportive. So I would say, I got a similar reaction everywhere I went: they saw that I wanted to study certain things, that I was willing to go the distance and learn new things, and they never felt threatened by me. They saw that I was serious, so I would say, for me gender didn’t really make a difference. That brings me to a question you haven’t asked yet, which was about my critics! Let me give you an example there. My first article in Psychometrika10 was on multiple component latent trait models that I had developed. When it was under review, one reviewer suggested why I didn’t cite the linear logistic latent trait model developed by Gerhard Fischer.11 And I didn’t know about that. It was not published in Psychometrika, but in Acta Psychologica, which is an international journal I had never had read. To me that was an opportunity to learn. I actually had another paper later on multiple component latent trait models where I basically incorporated that model into each component. If a woman shows a positive attitude to wanting to learn, I don’t think you run into as much resistance in the field. Do you think that the field in general is supportive? Not only of women in general but also of each other? Yes, but maybe not as much as another group I belong to, which is the Society for Multivariate Experimental Psychology (SMEP). It’s a much smaller group, and everybody has known each other for a long time. It’s a good place to try out ideas and get all kinds of perspectives.
Embretson, S. E. (1984). A general latent trait model for resonse processes. Psychometrika, 2, 175–186. 11 Fischer, G. H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359–374. 10
138
10 Susan Embretson
So how is that group different from the Psychometric Society? It goes back to Raymond B. Cattell, who founded the group, so his graduate students were among the first members. I would say SMEP tends to emphasize personality more than cognitive psychology, and not so much IRT, but it does have a lot of structural equation modelers in it. It’s a little bit of a different mix than the Psychometric Society, and the goal is more substantive than the Psychometric Society’s, which is more about the methods. I don’t think the Psychometric Society is a place where I’d put out new ideas I haven’t developed very well yet! We already talked about the biggest contribution of psychometrics to society, which would be all sorts of testing, but what do you think is psychometrics’ biggest scientific contribution? Well, the developments of the models to do testing. Also, a big emphasis now is structural equation modeling; it has been that way for a long time, and that has had a big impact on the way you make inferences from data. You set up models and look at how well the models work, versus just dredging around the data and finding a model. A traditional factor analysis would be dredging around the data and seeing what emerges, but in structural equation modeling, it is possible to take a confirmatory approach, where you determine what the factors should be according to some perspective and you look at how well the data fits. I think that has been a big change, and it has become state of the art to work that way. Would you say that is a nice example of how psychometrics and more substantive areas can match up? Right! Exactly. I think that is a good area. Joe Rodgers’ article,12 from 2010 in the American Psychologist, was on this change in inference. It was quite a nice summary of how making inferences has changed. I will also say that the measurement side of the Psychometric Society has made a big change in how testing is conducted. Testing used to be a lot of work: you had to make parallel forms in order to compare examinees, and the process of developing tests required a try-out of items and forms, and required test equating. Now we just have item banks and we just calibrate new items, put them in the bank, take them out, and design them as we want. We don’t have to try all of those items out anymore. So that really has made a big change in how testing is conducted. Looking into the future, what do you think is psychometrics’ biggest challenge? What is still a problem to be solved? A problem to be solved… That is hard to say! I think that the connection with the disciplines needs to be maintained and strengthened; I see challenges there in different locations. For a while, there was an expansion of quantitative psychology Rodgers, J. L. (2010). The epistemology of mathematical and statistical modeling: A quite methodological revolution. American Psychologist, 65, 1–12. 12
10 Susan Embretson
139
programs, and more and more were developed, which that was a good development, but I think what has happened is some of those programs have gotten too much away from the discipline and their contributions are going to be limited. Even in my own department right now, we had to fight to get a replacement of a real quantitative psychologist: they were thinking that maybe someone in another area of psychology, who was able to analysis, could teach a quantitative course, and I really think that is a bad idea. Do you think it’s still important that psychometricians or quantitative psychologists are to be trained in their own disciplines? That it shouldn’t all blend in psychology in that way? Absolutely. One kind of data that is important right now are these intensive repeated measures, or big data as you will. In psychology we like to take parametric approaches to things; we don’t want to do machine learning, we like to test our own theories. Well, that requires a good deal of expertise in psychometrics to be able to deal with that data, to test assumptions, to develop the appropriate models. I have some data like that right now: I have sampled two individuals and one of them has 8000 lines of data and the other one has 21,000 lines of data, so how do you manage such data and make good inferences? That is a challenge and I think we need more people who can actually do those things. So should psychometrics in that sense learn more from areas like machine learning? No: people in quantitative areas should open their eyes to the needs in the substantive areas. Mindlessly looking for relationships is very profitable, but there are people who are dealing with intensive repeated measures in a more parametric way, which is more relevant. We had a retreat in my department where I presented this view, and I got real enthusiasm from a lot of faculty members about this, because they have similar data they’re trying to deal with. They have data from fMRI, from electronic diaries, and they don’t know how to deal with it. They want to make inferences so they really need help. Now, that is the current pressure. And in the future, there’ll be other areas where more sophisticated methods are needed as well. I don’t know what those will be exactly but they’ll certainly be there. And it will require someone with real expertise, because when you’re dealing with big data, my goodness, your quantitative expert needs a lot of skills! Hence, making that interface is going to be challenge and you’re going to need a person who is really devoted to that. Hearing you talk about your work, you must have some important plans for the future! Absolutely! I have a backlog of papers that need to be written for one thing. I also did a 5-year project on developing an adaptive diagnostic test that maps deficits in middle school; I intend to write some papers on that topic. But, I also have a new direction, two of them actually, we’ll see if one of them ever works!
140
10 Susan Embretson
One of them is to link my generated math items to the dynamic learning maps. Dynamic learning maps show all kinds of skills that are needed to solve a particular problem. So you can keep going back, all the way to grade 1, to see what kind of math skills are involved. This dynamic learning map in math is currently on a four- foot by eight-foot Excel sheet in six-point font. What I want to do is to interface an item bank I have—they’re generated items so I know what’s in them in terms of skill—and put them on the map. Then we can look for a way to go back in the map, to build variance that will take out one kind of component and then another kind of component, to eventually come to a better diagnosis. This will involve some psychometric work as well, because you can’t do endless testing with kids; hence, I’m looking into new approaches for using distractors. Again, I know what’s in the distractors, because when we do generation, we have to generate distractors as well. I also have some interest in big data. I mentioned I had two people with 8000 and 21,000 lines of data, and I’m exploring that direction now, we’ll see what comes of that. And then of course I mentioned that I’m not finished with construct validity: I want to prepare a paper so as to better integrate construct validity with design possibilities and items. Still plenty to do! Plenty to do.
Chapter 11
Wim van der Linden
“If you really want to be practical, you have to go deep theoretically to know what you’re doing.” Wim J. van der Linden is professor emeritus of measurement and data analysis at the University of Twente and was president of the Psychometric Society in 1999. He wrote his dissertation under Don Mellenbergh’s supervision at the University of Amsterdam. Van der Linden’s research interests lie in applications of item response theory like computer adaptive testing, optimal test design, response time modeling, forensic analysis, and parameter linking. In each of these applications, he takes a Bayesian perspective.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_11
141
142
11 Wim van der Linden
Welcome, Wim van der Linden, thank you for your participation in this oral history project, on the history of psychometrics. In this interview, I’ll be asking you questions about three main themes: your career as a psychometrician, the relation between psychology and psychometrics, and your view on the history and future of psychometrics. I always start with the question, how did you end up in psychometrics? I think like most of us, I took a detour. I’m not aware of anyone who said, “I want to be a psychometrician” and then became one. No, it’s not a childhood dream. As a youngster, you wouldn’t even know there was an interesting field out there called psychometrics. I started in chemistry, I always wanted to be a chemist. This happened in the early 60s just when the double helix of DNA was discovered by Watson and Crick. The discovery was all over the press and sounded like the subject of the future. But also: chemistry had very practical applications, like the developments of medical drugs, so it looked very interesting to me. And then, as soon as I entered the university, my interest started changing: I hated the lab work. It turned out that chemistry was more of a theoretical interest to me after all. In my second year, I decided to switch subjects to psychology. This was another field that was emerging, especially in the Netherlands, where a decade earlier psychology had changed from the well-known continental philosophical tradition to a more empirical discipline, with a strong emphasis on methodology and philosophy of science. And that really clicked. What about it clicked? Something in my background always made me look for rigor. I certainly liked the more substantive psychology classes, which did offer interesting ideas. But a lot of them were mainly verbal and speculative; there was hardly any rigor. So, I wandered around. It was impossible at the time to specialize in research methods, statistics, or psychometrics. In fact, there weren’t any psychometrics classes as such at all. I remember one second-year class on psychological testing, which basically taught us classifications of all kinds of psychological tests with demonstrations of testing materials. So instead, I started specializing in social psychology but, again, kept looking for a hard core. You could take classes on research methods as an elective though, and the last class I took as part of my degree in social psychology was my very first class in test theory. We used an old textbook written by Magnussen1— it’s no longer available, a lot of us probably have never heard of it. But it was a very unusual in that it treated test theory in an axiomatic way. So, it started with the classical model in the form of a couple of basic assumptions, and then everything else was developed as theorems with a proof. Don Mellenbergh taught this course. He was brand new at Utrecht University, appointed to set up a new psychometrics Magnussen, D. (1967). Test theory. Reading, MA: Addison-Wesley.
1
11 Wim van der Linden
143
department. When I finished my master’s degree, I was so happy to be hired by Don for this new department and have the opportunity to specialize in psychometrics, especially test theory. It soon became clear to me that this also was a very practical area. I got in touch with colleague psychometricians at CITO, the main testing organization in the Netherlands, and also began attending national and international gatherings. Was it only Utrecht that was less involved psychometrics, or was it a nationwide? Amsterdam already had a longer tradition of psychometrics, especially test theory and educational measurement. Don came to Utrecht from Amsterdam and then moved back to it later on. I moved to the University of Twente, but I did my Ph.D. with Don in Amsterdam. Nijmegen and Leiden already had strong programs as well. So, what was your Ph.D. about? It was about test-based decision-making, especially its applications to topics like pass/fail decisions, treatment assignment, placement, and classification. Decision theory, and more specifically, optimization always has been part of my interest. When I went to Twente, I was very fortunate to get deeply involved in the start of a new program on educational technology. As part of it, I was able to establish a new department on research methods, psychometrics, and statistics for it. I read that you also did sociology. That’s very interesting. Was that before or after psychology? It happened along with social psychology. I didn’t want to take a psychological approach only but also look at the same topics from a sociological perspective. Part of my motivation was a faculty member at Utrecht University with a very strong background in mathematical sociology. My doctoral specialization was in game- theoretic approaches to coalition formation, again something with the flavor of decision theory. As part of it, I also had to delve into the theory linear programming, something that turned out to be quite beneficial later in my career. I completed my degrees after finishing my Ph.D. in psychometrics. I’ve never done anything with it, except that I still have an interest in game theory. Has it brought you anything other psychometricians don’t have? Game theory is a fascinating area and it really helps one understand social situations. Every time I open a newspaper or watch a news program, I’m reminded of the relevance of game theory. The same goes for the more substantive parts of psychology and sociology. I don’t pretend I’ve developed into a full-blown psychologist or sociologist, but their concepts and notions—to understand people around you, how they behave—are very valuable in life.
144
11 Wim van der Linden
What are the three most important topics you worked on during your career? When I came to Twente, I already had an interest in decision theory and optimization. I also had discovered item response theory (IRT). And I was very fortunate to develop an early interest in Bayesian statistics, mainly because Mel Novick taught a course on his book2 with Paul Jackson about it in the Netherlands in the early 1970s. The three pillars of my academic life have always been IRT modeling, Bayesian statistics for the statistical treatment of its models, and optimization theory to apply them producing practical solutions to real-world problems. As for these real-world problems, educational measurement has been my permanent favorite, mainly a result from the impact that Amsterdam—or rather: Don Mellenbergh—has had on me. In fact, in all of my academic life, I’ve followed a very simple principle: Anticipate where the field of educational measurement is heading, find out what its next generation of future problems is going to be, and work hard to be among the first to address them. In doing so, I’d focus primarily on the technical aspects of the problems. For instance, in the 1980s, it became clear that computers were going to revolutionize everything, so I anticipated a lot of testing was going to be automated. One of my first interests was automated test assembly, leaving test assembly to the computer, which basically requires an application of optimal statistical design. When testing turned adaptive in the 1990s, the optimization had to be applied in an adaptive mode. I also immediately realized that the use of computers would make the measurement of response times at item level possible. So I’ve also worked on the problems of modeling and applying response-time information to practical testing problems. My latest wave of interest is in IRT-based forensic analysis. It was clear to me that with high-stakes testing becoming massive and more anonymous, the danger of cheating and item security breaches was imminent. IRT and response- time modeling help us to detect such cases. Did you manage to be the first one to work on those problems? Probably in the sense of taking a step back, conceptualizing the new problem area, and delving into mathematics and statistics to find the right tools for solving its problems. Quite a few of my colleagues tend to jump at single problems trying to solve them in isolation. Their solutions typically work well locally but lose power when the problem specifications change or have to be integrated with solutions to other problems at system level. A good example is automated test assembly. Most of my colleagues have developed their own heuristics, which work well for specific problems but require extensive modifications and evaluation when applied to different cases. To me, it has always been clear that, formally, test assembly problems belong to the class of mixed linear integer optimization problems and we should resort to its algorithms
Novick, M. R., & Jackson, P. H. (1974). Statistical methods for educational and psychological research. New York: McGraw-Hill. 2
11 Wim van der Linden
145
with proven optimality to solve them. My eye opener was a seminal paper by Phiel Theunissen at CITO in the 1980s. The algorithms also offer us the flexibility to deal with the any set of test specifications we may meet in practice. I’m very happy to see a lot of this type of automated test assembly now being applied everywhere. It’s very valuable to have the option to just push the button and have a computer assemble any set of test forms you may need in seconds. Adaptive testing is another example. Its introduction has led to research on lots of specific practical problems. But the same framework of integer programming can be implemented adaptively to solve any of them with proven optimality. It’s nice to have done most of the work already and see that it’s going to be applied. So, when you went to Twente, you mentioned it was a new research group? A new department, an educational technology department. What distinguished the Twente group from other psychometrics groups? The University of Twente started out as an institute of technology with a strong engineering tradition. When it added social and behavioral sciences, their programs were developed to have an engineering flavor as well. “Making things happen” and “being innovative” have always been high on the list of priorities at Twente, not in the form of the nitty-gritty applications but as bigger technological developments. So that’s different from the groups in Amsterdam or Utrecht? Amsterdam is more philosophical. It also an interest in mathematical psychology. Once in a while I choose a more fundamental topic, I love doing that as well. But my main drive has been engineering educational testing. So, you’ve now worked for a US testing and software company rather than a Dutch university. Is the work fundamentally different? Not much. These companies do have good research departments. I haven’t been able to teach classes since I started working for them. But I’ve still worked with students at Twente on their dissertations until I had to retire 2 years ago because of the mandatory retirement age in the Netherlands. Also, these companies offer the opportunity to work directly on a variety of applications. You can learn a lot from applications, they’re very stimulating. I probably wouldn’t have made test security and forensic analysis research topics until I saw how relevant they are. Protecting your tests and fighting against cheaters is daily practice in the testing industry. And it’s what people need. Definitely. It’s a very relevant topic. There also are deep statistical aspects associated with it. You’re not going to blame anyone for cheating without much evidence; you need powerful statistical support to make claims.
146
11 Wim van der Linden
So, what are you currently working on then, if you’re not traveling? Still the same topics. The company I currently work for is changing its course; it used to be a smaller testing company but always had a strong emphasis on technology. It now is in the process of reinventing itself as a software company developing solutions for the testing industry. It has quite a few software engineers employed. I love working with them. We are basically doing the research, developing the models and algorithms, and they are writing the software. Are you working on a specific program now? The main project is implementation of adaptive testing. There are still a few open ends there. For instance, most of it has been developed for dichotomous items. Right now, my team is working on generalizations to polytomous items. I’m also planning to work on adaptive testing with open questions that need machine scoring, so there’s another interesting challenge. In fact, all of testing is going to change. It’ll no longer be one big test administration, with huge data files and one-shot analyses. What we will see is continuous testing with stream of data controlled by parameters that are updated in real time. The principle of adaptation enables us to plan and optimize these processes one item administration ahead. Eventually, we will even see computer algorithms generating and field testing items on the fly. So adaptive is going to take over? Sure of it. Is it possible to apply it in all different kinds of education problems? Absolutely. And not just in educational testing but in psychological testing too. And there are very interesting applications in medical diagnosis now, especially the part of it that deals with patient reported outcomes. Medical doctors have used long questionnaires to diagnose patients or evaluate therapies. The alternative is to give them a handheld device in the waiting room or send them home with an app on their mobile phone that shortens the procedure tremendously by going adaptive, zooming in on the right questions and stopping when enough precision has been reached. The app allows medical doctors to monitor their patient in real time. Looking back at your career so far, what do you think is your biggest contribution to psychometrics? The same things we just talked about. Making automated test assembly a workable technology is probably at the top. But also my contributions to adaptive testing. And I was one of the first who realized how to parameterize response time models to make them usable for application in educational and psychological testing.
11 Wim van der Linden
147
Did you ever receive strong criticism? Have you ever done controversial work? Not in the sense that I have personal enemies! But, of course, each field has its traditions, and traditionalists don’t always agree with what you do. I’m a stubborn IRT person. And for very good reasons. Its basic idea is to model your observations parameterizing everything that has an impact on it. If it’s testing, you observe responses to test items and need parameters to account for the effects of the abilities that are tested, the properties of the items, and if there are other conditions you study, you parameterize their effects as well. The presence of these parameters allows you to ascribe effects exclusively to known sources, that is, to avoid the methodological pitfall of confounding of effects. To me, avoiding conclusions that are confounded is a fundamental requirement of any healthy science. The same approach works for the observation of any kind of behavioral data. Of course, you have to estimate your parameters and take checking the empirical fit of your model seriously. So, much of the approach takes you directly to the heart of statistics, and in this sense it’s complicated. Some people think it’s nonsense. They want “to stay close to the data” and believe the necessary mathematics just obscures what you’re trying to observe. It is hard to convince them. But personally I rather deal with the mathematics than selling conclusions that are fatally confounded. That’s part of science! After speaking to so many psychometricians, I have the feeling that some of them think that psychologists don’t pay enough attention to things like IRT, even though it could be very useful to them. Is that your experience as well? Absolutely. What I’m now going to say is oversimplified. But when I was in school, I learned that there exist a huge difference between what was called general and differential psychology. General psychologists were experimental and supposed to have an interest in average effects, whereas differential psychologists analyzed the remaining variance introducing variables trying to explain the differences between people. I think that’s a hopelessly unfortunate distinction. I’m not aware of much progress in differential psychology. It might be an interesting area, the research might be stimulating and creative, but I have my doubts. I was trained as an experimental social psychologist. But I have my doubts there, too. You run your cleverly designed experiment, create a few different conditions, observe the effects, and— good heavens—when they don’t agree with someone else’s experiment, you reflect on this and create a new experiment. In the long run, the only thing you end up with is a catalogue of averages. And averages don’t tell you much about your individual observations. What is missing in both traditions is models explaining effects at the level of individual subjects. Let me give an example of response times. The typical experiment in psychological response-time studies is a single task replicated by a number of subjects with observations both of the time to complete the task and whether it was a success or a failure. Its main result is a scatterplot with the proportions of success against response times across all subjects. The mathematical aspect is to fit a meaningful
148
11 Wim van der Linden
curve through it. However, in doing so, all subjects are treated as exchangeable, which they definitely are not, some work generally faster than others. And when you move to another task, you’re bound to find other effects, without any way to relate results across experiments. I came up with an approach along the lines of IRT modeling. What you need to model are the individual response times on test items with parameters both for the effects of the individual items and the subjects. The differences with the models in mathematical psychology are that these models treat all subjects as working at different speeds, and the items formulating tasks that require different amounts of time. The presence of these separate parameters allow you to predict differences in response times both between subjects and test items. The beauty of it are the applications that immediately make sense: setting time limits on tests, using response times to improve item selection, the development of time management tools for testers, detecting items that have been compromised, etc. I’ve written quite a few papers on applications of this type of model. A model-based approach, at the level of the individual subjects, with full parameterization of everything that has an effect is definitely something psychology should learn from psychometrics. If I understood you correctly, you have suspicions about the experimental tradition, but also about the individual differences tradition. Yes, it’s still hard to find general laws in these two traditions. I might be old- fashioned, but the goal of any science is to discover laws, how things work together. There are not many laws in psychology yet. We have a very rough time finding them. But you believe they exist? If you want applications, you need laws. Physics became successful when it began producing its laws. And these laws enabled us to engineer solutions to practical problems. The same holds for any mature science. Do you envy those fields? Do you envy physics? I sometimes encounter people who are jealous of a field that is able to formulate those laws and prove them. I’m not jealous, not at all. Actually, a while ago, I turned back to reading some physics and some of its problems are almost exactly the same as in psychological. Georg Rasch wrote a very interesting book,3 especially his first chapter and an appendix at the end of the book which, I think, should be required reading for every psychologist. In the appendix, he shows how one of the fundamental laws in physics (force = mass*acceleration) is completely analogous to the parameter structure of the Rasch model, completely with the same status of its variables as observed and latent. We should not envy physics but enjoy our common mission. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: The Danish Institute for Educational Research. 3
11 Wim van der Linden
149
So, you studied psychology and sociology. Do you consider yourself a psychologist? Not professionally. I occasionally read psychological literature to find out where its progress has been. Lately, I’ve been reading quite a bit about research on the speed- accuracy tradeoff, one of the serious laws in psychology. As a psychometrician my interpretation of it is different though, as a within-person law rather than the between-person phenomenon recorded in psychological experiments. But, again, I’m no longer a trained psychologist. So, what do you think is the relation between psychology and psychometrics? Ideally, the two should never have grown apart. But given the state of affairs, I think it’s good for them to be relatively separated. We don’t need much psychology to do psychometrics. But psychologists need psychometrics to become successful at psychology. At a more practical level, psychometricians need a strong training in statistics and computer science to advance their field. You don’t necessarily need the same level of training to be successful in psychology. So, why do we need psychologists? To detect the laws that govern human behavior, which is an extremely relevant mission. But not for your work. No, not for my work. That’s probably because I’m specializing in educational measurement. It would be different if you specialize in psychological testing. Then you’d definitely need to be up to date with a substantive area. So educational measurement could do without a substantive area? For all practical purposes, yes. With one exception: If you want to explain what cognitive processes happen during testing as, for instance, in diagnostic testing for instructional purposes, then not. But you don’t no need to know precisely what happens cognitively during testing when the goal is to make college admission decisions, evaluate educational progress, or offer vocational guidance. So, what is the current role of psychometrics in society? Twofold. Psychometrics could be a major support to psychology. It could make its measurement more rigorous, help to plan its experiments better, and model its empirical data. On the other hand, there is an ever-expanding educational testing industry, right now probably more so in the United States than in the Netherlands. Other areas are following. I’ve already mentioned medical diagnosis. Another emerging area is marketing research. And now that economics has become more behavioral, it is bound to follow. These enterprises need psychometricians to perform the daily operations.
150
11 Wim van der Linden
Is that also psychometrics’ biggest achievement? That it has become such an industry? Yes, it does have applications with far-reaching personal and social consequences. You can be a different type of psychometrician of course, one focusing on statistical questions only. We badly need their expertise as well. But I happen to be in a part of psychometrics that has been driven by applications. Do you think the practical is more important than the theoretical? No, no, no! There’s no distinction there! Dangerous territory here. Now you get me going. I know a lot of people who pretend not to be interested in theory because they want to be practical. But it’s actually the opposite. If you really want to be practical, you have to go deep theoretically to know what you’re doing and guarantee success. You should always be able to defend your solutions. Also, I do like applications because they’re technically much more challenging than purely theoretical questions. You need much of statistics and mathematics to solve practical testing problems. Are those also the fields that you mostly use. Yes, in a sense, psychometrics is applied statistics. The difference with other fields of applied statistics only lies in the models it uses. But how to treat your models is what you need statistics for. Who has inspired you most in your career? It can be someone you actually knew, an article or book that really fascinated you. I’ve already told you what happened with my degree. I did my first class in test theory, the last class as part of my degree. The next thing that happened—I’ve always been very fortunate—was a small reading group organized by Don Mellenbergh. We started reading Lord and Novick,4 which had just been published. Most of it consists of classical test theory but, again, presented in an axiomatical way. Its last part, written by Alan Birnbaum, was my first introduction to IRT. Shortly after this, I was able to spend a summer with Fred Lord at Educational Testing Service, who worked on a new book on item response theory5 and used the manuscript to teach an extended course. A little later, as I’ve already told, Mel Novick visited the Netherlands teaching the text of his new book with Paul Jackson on the introduction to Bayesian statistics. Quite fascinating to learn everything firsthand. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. 5 Lord, F. (1980). Applications of item response theory to practical testing problems. New York: Routledge. 4
11 Wim van der Linden
151
I heard about this, from Ivo Molenaar. Yes, there was a whole series of these events in the Netherlands, organized by Ivo Molenaar at Groningen. With Gerhard Fischer, Darrell Bock, Paul Holland, all the big stars teaching new developments. Those are all the people I’d love to talk to but that has become slightly difficult! Some of them have retired and are no longer active. Others have sadly passed away. But they were incredibly instrumental for psychometrics? I think so. Especially, for the more conceptual part of psychometrics, the idea that you need to model and how you should do that. For technical sophistication, we had to wait until the 90s. At the moment, we profit tremendously from newer developments in statistics, especially Bayesian statistics, computational methods, and computer science. There were hardly any computers around in the 1960s and 1970s. It’s still sometimes hard to imagine how people did psychometric work in those days; it has become so intertwined with computation. It must have taken them weeks or months to calculate something. That’s completely different now. As a consequence, they didn’t publish much, not as much as we are able to do now. But they helped establishing a tradition in which you are required to be productive. Do you find parts of their work in yours? Yes! I consider myself the next generation. I’m standing on their shoulders! Okay, so who is part of that next generation? The first generation were Lord and Novick? When I got older, I became more interested in history. If you go back, it all started with Alfred Binet and his intelligence test. Binet has not received the credit from psychologists he deserves. He has been generally hailed as the inventor of the standardized test but that’s wrong. The whole idea of standardization is due to nineteenth- century German psychologists and had already been around for quite a while. Binet’s real contribution is that he was the first to scale his test items. Nobody had ever seen the necessity to do so before. His scaling existed of a field study where he tried out his items, fitted response functions to his response data, and used these functions to derive the scale values he needed to score his subjects. The one who picked it up, and who absolutely is my second favorite, was Thurstone. Binet’s response functions were empirical, but Thurstone noticed that their shape was close to those of normal distributions functions and started modeling them this way. He still struggled conceptually with a few issues but made quite some progress. In the 60s, there were Lord and Novick, Georg Rasch, and that’s when IRT really took off.
152
11 Wim van der Linden
Thurstone also started the Psychometric Society, and set up a journal, so he was really at the foundation of psychometrics as an established known science. But Georg Rasch was always a little bit on the outside right? Georg Rasch was a bit of an outsider. He was a mathematician-turned statistician, with an interest in intelligence measurement because of his relationship with the Danish Institute for Educational Research. He was fully aware of a new tradition in statistics, now known as conditional inference, which helped him to treat his own models. Conceptually, though not a psychologist, he was a star. I really like the first chapter of his book. Earlier on in this interview, we talked about psychological experiments that erroneously treat subjects as exchangeable. In those first 15 pages, he deeply analyzes psychology and came with an alternative: the individual-level, model-based approach we discussed earlier. And then, in his appendix, he shows the analogy between his approach and how physics developed one of its fundamental laws. I find the keenness of his analyses amazing. I still reread these pages once in a while. Are there things that psychometrics has not achieved yet which you consider important? What is its biggest challenge? Selling itself. I think we discovered that very late. I blame myself as well. I would have had a much bigger impact if I had developed user software. I’ve not really been in the business of developing software that everyone else could use. Frankly, I’ve used the computer only to run examples I need for my own papers. It’s nice to have articles and books with the results from your research on your shelves. But if you really want your stuff applied, you should deliver user-friendly software. But potential users want software to run our models, and there just hasn’t been much of it. Would that also be a way to the psychologist’s heart? Yes, we would have had much more impact on psychology if we had produced user- friendly software. It’s not only a problem in psychometrics; statisticians have long had the same problem. They struggle now with data science as a competitor, which basically consists of computer scientists reinventing statistics. The same might happen to psychometrics. We have all underestimated the impact of computers, how powerful they are and the need of good algorithms and software. Most psychometric programs still underestimate the computational part. I’m not aware of any that offers mandatory classes on computer science. It should become our second nature to write user-friendly software. The presence of R certainly helps. There are more and more psychometric software packages available in R. But not many psychologists are familiar with this language, so we still have a problem. In the 1990s, we had an organization called ProGamma at the University of Groningen. Have you ever heard about it? I don’t know exactly who started it. But it was based on the fantastic idea of taking the development of user software out of the hand of researchers. ProGamma was subsidized by the Ministry of Education or the Dutch Science Foundation NWO; I don’t remember these details. During annual
11 Wim van der Linden
153
calls for proposals, you could submit your research software which, when accepted, would be develop into user software, completely with documentation and a help desk function. Then someone pulled the plug and that fantastic service was gone. ProGamma was housed at the University of Groningen; ask around there and they’ll tell you the story. So, part of the job description of a psychometrician is to teach those methods? Yes, this is how we should train the next generation. So, one of the challenges within psychometrics that we need to solve is to find a solution for the fact that we’re not communicating well. But is there a scientific problem that psychometricians need to solve, is there still a gap in our knowledge that we should address? We are leaving the area where educational testing consisted of periodic massive test administrations each with a developmental cycle consisting of item writing, field testing, item calibration, test assembly, administration, data cleaning, forensic analyses and, finally, examinee scoring—basically the process introduced by Alfred Binet more than a century ago. Rather than these cycles, the presence of powerful computers, the Internet, and cloud storage will replace each of these steps by parallel real-time processes with permanent parameter updating and optimization. To support this development, we need to rethink all of current psychometrics. And is psychometrics going to play a role outside the testing area? Is psychometrics going to play a role in other fields? Perhaps in neuropsychology? I think so. It’s amazing how international the Psychometric Society has become. There used to be American meetings only. Then more and more European researchers got involved, and the last 10 years I’ve been happy to see attendants at our annual meetings from every continent and with a large variety of backgrounds. Biostatisticians have developed an interest, thanks to the medical applications I’ve mentioned earlier. The same has process happened to our journal, Psychometrika. As a final question, what are your own plans for the future? Remain active as long as possible. I’m no longer at the beginning of my career, so I’m using my time as cleverly as possible to make the most of it. There are still quite a few open questions, things I like to finish. There are two more books I want to write, one on response times and a textbook on item response theory. The latter will be more conceptual than usual but at the same time technically advanced, completely different from other IRT books.
154
11 Wim van der Linden
Do you reckon that you might be a psychometrician the rest of your life? There will come a point when I’ll be no longer able to. I don’t know what’s going to happen to my mental abilities. But right now, I’m still pretty sharp and have a lot of energy. So, I’ll keep going! With that I think we’ve come to the end of the interview, so I want to thank you. My pleasure, thank you!
Chapter 12
David Thissen
“I’m exclusively a psychometrician, I was exclusively a psychometrician before I left graduate school, and I’ve never pretended to be anything else.” David Thissen is professor emeritus of quantitative psychology at University of North Carolina at Chapell Hill and was president of the Psychometric Society in 2000. Thissen wrote his dissertation at the University of Chicago in 1976 under R. Darrell Bock’s supervision. His research interests include item response theory and differential item functioning.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_12
155
156
12 David Thissen
Welcome David! Thank you for your participation in this project. This project is an oral history project on the history of psychometrics, and in this interview I will be asking questions concerning three different themes: first, your career as a psychometrician; second, the relationship between psychology and psychometrics; and third, your view on the history and the future of psychometrics. Just the first question: how did you end up in psychometrics? Well, that is actually a sequence of stories. I sometimes think I should tell the story now because I’m increasingly noticing that students seem to think that they can plan their lives, yet I really didn’t have such a plan. The beginning of my winding up in this business involved the first statistics course in psychology in college, which we were all required to take. It was taught by someone who was using, as it was called then, a programmed learning course. It involved a great deal of tedious arithmetic, all done by hand and paper of course. This was before we could use any computers! Back in the day! I really didn’t like to do all of that arithmetic. So I didn’t get a very good grade: a B instead of what you’re supposed to get. And I was unhappy about that. I decided that I would demonstrate to myself and everybody else that I didn’t have any problem doing the statistics part of psychology, so I decided I’d take more statistics courses. Now, the only other statistics courses I could take was the first-year graduate sequence, which was officially for graduate students, but undergraduates could take it and I was the only undergraduate that did. I went ahead and took that course, and I had no trouble with it because it was actually serious, or realistic. I also took a course in using computers just because I was interested. And I didn’t really think about these courses in a career sense; they were just things I wanted to do. When I applied to graduate school, I intended to become a developmental psycholinguist, because I was interested in the development of language. I had worked with cognitive psychologists while I was an undergraduate, and at the University of Chicago where I hoped to go, there was the best psycholinguist in the world. But of course, at the time, application for graduate school happened by sending your application by paper mail and requesting catalogues. The universities would send you very thick paper documents and then you’d send your application back. There was no Internet; you couldn’t look anything up. Good old times. In my wisdom at the time, it turned out I had applied to the wrong department. I thought the development of language would be in the group called human development, because well, shouldn’t it? The professor (David McNeill) I thought I wanted to work with was in the psychology department, which was separate, so he never really knew I applied. Human development was an interdisciplinary program, and their methodologist—the name they put on their grant applications so they could say they had a methodologist working in their department—was a professor named Darrell Bock. The people who would normally apply to human development were
12 David Thissen
157
usually not very interested in working with Darrell, or Mr. Bock as he was referred to when we were in Chicago as students. So they saw my application and all the statistics courses I had taken, and they assigned me to him. I knew when I applied for graduate school, that I wanted to go to Chicago, because it was a place of some distinction. It turned out to be worthwhile. I really didn’t know who Mr. Bock was—we met once at my interview when he asked me questions I couldn’t answer—but they accepted me. I went there still with the intention of becoming a psycholinguist. And Darrell’s method of advising students was extremely un-controlling. His students could basically do anything we wanted, and we had the privilege that we were able to get his attention. We would tell his secretary that we wanted to talk to him and then we could. No one else could get in. We were special! There were only a small number of students he trained and we were taken care of very well. One of the ways he trained us was to let us do whatever we wanted, so, I worked in the developmental psycholinguistics lab for about a year, until I figured out it wasn’t really for me after all. The field was strongly dominated by formal linguistics, and I just did not have the head for that. Meanwhile, I was taking Darrell’s classes in statistics, and by the second year I was also working on research projects with both Darrell and Howard Wainer, who was working in Darrel’s research group at the time. I noticed that doing the statistical analyses was relatively easy for me. It appeared to be a great deal easier for me than for almost everybody else, except for the other quantitative students, and I lost interest in psycholinguistics. By the second year, I was mostly doing data analysis. I had dropped the psycholinguists, and I took the test theory course. I had no idea how remarkable that was at that time. The course was jointly taught by Darrell Bock and Ben Wright, who was at that time the leader of the Rasch model movement in the United States. Darrell and Ben did not want to spend a lot of time talking to each other, so they alternated in giving lectures. For the time, looking back on it, I now realize that was the most remarkable test theory course anybody ever had, because it went entirely beyond the state of the art. So I became interested in test theory. I wrote a paper for the course, and then I, being the foolish graduate student I was, wound up submitting a bad term paper. But I sent the paper1 to a journal and someone printed it. It wasn’t that good, but I was doing things that other people couldn’t do because we were doing state of the art work. And I decided that test theory was going to be easy for me. Easier than psycholinguistics. Certainly easier than the psycholinguistics. I could see things people weren’t doing that maybe I could do, so I moved into test theory as my path of least resistance. Anyway, it wasn’t planned; it all had to do with that first statistics class, which I subsequently taught very different than he did, but that was that. Thissen, D. M. (1976). Information in wrong responses to the Raven Progressive Matrices. Journal of Educational Measurement, 13, 201–214. 1
158
12 David Thissen
What appealed to you about psycholinguistics in the first place? I was and am very interested in the magical processes by which kids learn to talk. My granddaughter is now between 1 and 2, so she is rapidly going from 1 word, to 2 words, to full sentences. The fact that humans can learn to talk is really amazing. So what about it wasn’t amazing when you worked at the psycholinguistics lab? The process remained amazing. The research methods were dominated by Chomsky and linguistic analysis, which involved very formal descriptions of the language. And it’s kind of odd. I do alright with a limited range of mathematics which is a set of formal descriptions, but the logical symbolic descriptions of language that were Chomskyan linguistics…that I struggled to wrap my head around. It’s probably for the same reason I don’t do well with languages that are not English! I never succeeded in learning any other languages to my great regret. But it’s just a weakness in what I can do. So I did what was easiest for me. But there must have also been something that sparked your interest about psychometrics? It was mostly just that we could do it, but I also had an interest in it of course. My first topic in test theory involved parts of item response theory that deal with questions that have more answers than just yes or no, or right or wrong. There are a lot of people who develop test theory for right and wrong questions in education, and I could see the world didn’t need anymore test theorists that could do that. But one of the things Darrell was working on that I borrowed from him was the nominal model for polytomous responses. I’m not sure I like that word, it’s an odd hybrid of Greek and English, but you can also call them multiple category responses. Examples are Likert-type responses which have 5 answer options ranging from strongly disagree to strongly agree, or nominal responses, where you choose one of four or five answers. That first paper was about this, specifically about the Raven’s progressive matrices (a test of pattern matching) as a cognitive ability. The questions in these matrices have six or eight alternatives that you can pick to fill in a pattern. The paper was about the structure of the incorrect responses, the distractors, and I used a model to see if people who were more able, but not quite able enough to get the right answer, chose specific answers, so that some answers were less wrong than other answers. We have a model, which would give people partial credit. The multiple category models were no more difficult than the right/wrong or yes/ no models, but almost nobody was doing it, and so, I could make it my own. I wasn’t doing much inventing. Darrell Bock had invented the nominal categories model2 and Fumiko Samejima had invented the graded model.3 Fumiko Samejima actually
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51. 3 Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometric Monograph, No. 17, 34, Part 2, 1–100. 2
12 David Thissen
159
invented a lot of that work. Her descriptions were considered very difficult to read by almost everybody, including me. And so I didn’t have to invent things but I tried to work out what Fumiko was trying to say, and then say it a little bit more simply. Going from there, I developed a number of interests. What other projects did you do? After I got my first job, I did projects in developmental psychology or cognitive psychology, like studies of interference in the Stroop task. I did the statistics for a study on what children learned from watching television. What came out of that? A great deal! It was a unique set of studies organized by Aletha Houston on what she called prosocial behavior. It wasn’t about television that will rot your mind; they were studying the prosocial effects of Sesame Street. The data were ratings of the programs in terms of how prosocial they were, and they would observe kids who would watch Sesame Street. It was an elaborate study. All I did was some help with the data analysis, though there was a fair amount of that. Did you do a lot of data analysis for other researchers? Yes. After I finished graduate school in Chicago, my first job was at the University of Kansas, and there were only two of us who were quantitative psychologists. Susan Embretson was the other one. It was just the two of us for the entire psychology department. We taught the quantitative course and collaborated with others on the faculty. When you finished your PhD, what was the next step? Again, I took the path of least resistance. I guess I was trained to be a professor, so I decided to try and do that. This is another story of serendipity. There weren’t that many positions to apply for. The only one job that I was offered was at the University of Kansas, and technically, it was a 1-year replacement, not a permanent position. For reasons that will become clear, they sort of knew it would be permanent. So I went through the procedures for hiring permanent staff. There were two quantitative psychologists there. The one before me was Julie Shaffer, but she had gone on sabbatical for a year at Berkeley. She was working with Erich Lehmann, who was an extremely well-known mathematical statistician. She and Erich became close, and toward the end of her sabbatical year, she decided she would rather like to stay in Berkeley and marry Erich Lehmann than return to Kansas. So she had gone to Berkeley and did not return, and I was hired to substitute for a year, but everyone knew she would probably not return. She remained in Berkeley, and after my first year there, I was sort of rolled on to the faculty. It had actually been the plan all along. At the following Psychometric Society meeting, both Julie and Erich attended, and the first time I met Erich, I thanked him for arranging my job, which he took very graciously.
160
12 David Thissen
Did you start a new line or research at the University of Kansas? The collaborations continued; I did some programming. I turned my interest in multiple category models into writing a computer program to fit item response theory models.4 The path of least resistance is definitely a theme here. Kansas is a state university, and state universities in the United States have procedures for getting equipment where you have to find the lowest bid for whatever you want to buy, so, due to that I worked at the university where the copiers were not Xerox, the elevators were not Otis, and the computer was made by a heating and air-conditioning company. We used Honeywell computers and that was certainly different from the IBM top of the line machinery that I had been using at the University of Chicago. And you couldn’t easily move a program from one system to the other, because the IBM in Chicago was a like a PC and the Honeywell like a MAC (to make a contemporary analogy). I had the Fortran source for item response theory for IBM computers, but it was going to be a great deal of work to convert those IBM programs to work on this Honeywell. Over Christmas vacation of my first year there, I figured that I didn’t actually like the way these computer programs worked, and if I was going to do that much work, I might as well write my own program and make some changes. As I told you, my interest was in multiple category models, and I thought someone might want to do a test or questionnaire in which they ask some questions with dichotomous answering options, and some other questions that were graded or scored in multiple categories. The fact is when I was doing this in 1976, 40 years ago, the theory for doing that was obvious—though no one had ever said anything about it—but computer programs couldn’t do that yet. There was a computer program for dichotomous answers, a computer program for Likert-type answers, and a computer program for nominal kinds of answers. But no computer program for combinations. Actually, it was odd. When I told my then former advisor, Darrell Bock, I was going to do this, he said “why would anybody want that?!” It turned out, actually, nobody did, for about 15 years after I wrote the program. Educational psychology can be a bit faddish. Later, it became a fad to have what they called “authentic assessment” or “performance assessment,” which involved long answers to questions that were rated by judges. And from the point of view from test theory, this becomes like the Likert-type scale: you’d either get zero, two, or three points. And then they also wanted to include multiple choice questions. So actually, 15 years after I wrote the computer program that could roll them together, someone actually wanted to roll them together. But anyway, when you write a computer program, you test it, you see what it can do, and you write papers, because the computer program itself doesn’t contribute much to career advancement in academics. But we could find things to do with it.
Thissen, D. (1986). MULTILOG [Computer program]. Mooresville, IN: Scientific Software.
4
12 David Thissen
161
If you look back, what was the most influential moment of your career? Was there maybe a paper you’ve written that you think made an important contribution, or maybe this program? That would depend on the audience. Multilog was certainly my entree to other things. Multilog is why people would call me and ask me to do some analysis or collaborate with them; that’s why they had heard of me. So from my point of view, the most influential project was the computer program. But I mean, anybody who wants to work hard can write a computer program. Intellectually speaking, it would be a paper in Psychometrika Lynn Steinberg and I did in the middle of the 1980s.5 The paper was about different ways item response models could be used. For example, all responses options could be ordered, or one of them was by itself and the rest of them were in an order, or that two responses were in one order and two other responses in a second order. Anyway, it was something others were sort of missing. It’s not really invention, but rather discovery, it’s about seeing what’s already there. That paper is probably one of the more ingenious things we managed. It’s what I’m proud of. If you ask other people, or if you look at citations, it would probably the 1993 chapter in Howard Wainer and Paul Holland’s book Differential Item Functioning.6 That chapter set a standard for how DIF was done with standard statistical tests. The presentation I’ll do tomorrow morning is actually a spinoff from that chapter. The curious thing is that the DIF work was not even a path of least resistance. It was just already there by accident. We had written Multilog, and it could do analyses of performance of test questions of two groups, or more than two if you wanted. Then other people in the 1980s started to develop ideas about differential item functioning, which originated by trying to find items that performed differently between groups, like boys and girls, or black and white students in American schools. By performing differently in these different groups of people, the item would contribute to test bias against one group or another. We wanted to trace the items that contributed to test bias and remove them. It’s a social goal. I wasn’t very involved with this until I heard about it, and then I realized that we had already written the computer program, which could be used to do this. All we had to do was write the papers that explaining how to use Multilog to do DIF analysis. That was not only the path of least resistance, but almost the path of negative resistance: all we needed to do was explain how to do the analysis. And that produced a whole lot of papers.7
Thissen, D. & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567–577. 6 Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale, NJ, US: Lawrence Erlbaum Associates, Inc. 7 Steinberg, L., & Thissen, D. (2006) Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychological Methods, 11, 402–415. 5
162
12 David Thissen
This path of least resistance comes back a lot in your career. I do not plan! I was wondering, are you now passionate about psychometrics? Yes, oh yes, I’m exclusively a psychometrician. I was exclusively a psychometrician before I left the graduate school, and I’ve never pretended to be anything else. Not a psychologist? I consider quantitative psychology part of psychology. Psychology has been Balkanized, cut into subfields, since before my time. Psychology in the United States was divided into subfields in the 1960s; researchers considerably older than me—who are now regrettably mostly dying—got their Ph.D.s in the 1950s when there was just one type of psychology doctorate degree. These people would know both quantitative psychology and experiments that involved rats in mazes. In the 1960s, the United States government started to give training grant funds to programs in universities, and they divided them by subject area. They divided psychology into a handful of kinds of psychology, and this produced separate fields: quantitative psychology, development psychology, social psychology, cognitive psychology, and biological psychology which is now neuroscience. The psychology departments also divided their faculty into groups to seek these grants, and that became a substructure of psychology departments, which previously had not been divided. The subunits then gained a life of their own. So we all work at two levels: we are psychologists in the psychology department, and we’re all in our subdisciplines, some people are in more than one. It’s now almost inconceivable to get to the state of the art in more than one of these subareas. You can never know enough. Is psychometrics now one of those subfields? Yes. Certainly not every psychology department has quantitative psychology, but some do. There are enough departments to train students. So what is the ideal situation for psychometrics? It’s pretty much ideal where I am. I moved from Kansas to North Carolina in 1990. At UNC I could work with graduate students because there was a program of quantitative psychology. In Kansas there was only the two of us, Susan Embretson and myself, and though she worked with a couple of graduate students, it was really challenging, as we didn’t have a program. North Carolina has had the psychometrics lab since Thurstone went there in 1952. It’s where I’ve been ever since. There are half a dozen of us in the quantitative psychology program, we train students in quantitative psychology, and there is a psychology department. I think what quantitative psychology is supposed to do is solve problems for psychologists. Just a tiny example, is the source of the presentation I’ll give tomorrow morning. There was a student in clinical psychology at Chapel Hill who
12 David Thissen
163
translated a couple of clinical psychopathology scales into Chinese. I’ve always tried to avoid translations, I consider it impossible to get right, but people do it anyway. And you can use differential item functioning, DIF, to look at the performance of the items in the two languages. You can see if some of the items are ranked similarly by people in China and the United States. Some of the questions bounce around and become much easier to endorse, or much more difficult to endorse, and if that happens, it probably means that the translation doesn’t really carry the same meaning. You should either retranslate the question or not use it at all. This is now standard operating procedure for translations of tests. The catch was that the particular questionnaires the student wanted to use, used 7-point response scales, and as I said, I’ve worked a lot on models for multiple responses. The student ran into a lot of difficulty analyzing a large number of items with seven alternatives. If you try and analyze them as just seven discrete alternatives, there would simply be no one who chooses the seventh option for the third item, or the first option for the fifth item, because sprinkling 200 people across seven categories with most of them in the middle doesn’t leave much at the edges. So I told him that we were going to take this easy, and treat the 7-point scale as a continuous scale, and do factor analysis, which is item response theory by other people. We’ll use the linear models, and we’ll call 1 to 7 numbers, instead of categories, and then if no one picks option 7, we still have 1 to 6, it’s all fine. We have a long literature of factor invariance in factor analysis and differential item functioning in item response theory. They are the same thing, but in fine detail, the factor analysts do it differently; they were looking at whether they were measuring the same thing across a set of scales. Item analysts in test theory look for misbehaving questions. So item response models concentrate on the questions and the factor analysts on the overall model. When I tried to have the student read the factor analysis literature, it didn’t line up with what I wanted him to do with the data from the questionnaire. What I’m going to do talk about tomorrow is how to take the factor analysis procedures and do with them what an item analyst would do. None of the factor analysts have never quite written it down in an item by item way. You have actually written a paper on the history of psychometrics yourself. Half myself. Lyle Jones was a good deal of that. Are you interested in history? What was the motivation for writing that paper? The motivation for writing that chapter was that we were asked to. But the people who asked knew I could talk Lyle into joining me. Regrettably he is no longer with us, he just passed away 3 months ago, but he was one of the people who did his degree in the 1950s. His graduate mentor was purely an experimental psychologist, not quantitative at all, and yet Lyle trained more quantitative psychologists than anyone. At any event, he was certainly interested in history. He knew Thurstone and worked with him.
164
12 David Thissen
I was interested in psycholinguistics because I was interested in developmental psychology, in how things develop over time. There are two kinds of people in the world. One kind tries to understand things by figuring out how they got to be the way they are. Another kind tries to understand things by taking things apart and putting them together. This actually leads to some lacks of understanding inside psychology departments. For example, traditions like cognitive and social psychology tend to be of the latter kind, who by using experiments take things apart to see how something works. Developmental psychologists on the other hand look at how things change over time. The people who seek one kind of understanding don’t always consider the other kind of understanding as a proper understanding, because it doesn’t “explain” to them what they want to know. In any way, my turn of mind was developmental and historical, which are kind of the same thing. In a way they are, I guess. So I’m interested in how things in statistics got to be the way they are. It’s also why I was in the developmental psychology program in graduate school. And for understanding the sociology and structure of our discipline, I go back to how the field developed over time. So yes, I’m interested in those things occasionally, and I write about them. When you look back on the history of psychometrics, who do you admire most? Thurstone for sure. Thurstone made everything. Thurstone made the discipline; he came from nowhere, got a degree in engineering and created quantitative psychology, created scaling, changed factor analysis into multiple factor analysis. He founded the Psychometric Society. He trained many of the people who trained the people who trained us. If you think of one of the most influential papers in psychometrics, would that also be his? I’d probably choose his body of work rather than any individual paper; his papers were all on different topics. I start chapters with a reference to one of his 1925 papers which is on educational psychology, testing, and many quantitative psychologists would consider that paper frankly obscure.8 It carries the seeds in it of item response theory. Thurstone sort of invented the idea that none-physical objects could be scaled, could be assigned numbers. If we had a Nobel Prize… That would be nice. We would give him one for that idea.
Thurstone, L. L. (1925). A method of scaling psychological and educational tests. Journal of Educational Psychology, 16, 433–451. 8
12 David Thissen
165
What would you consider the biggest contribution of psychometrics? The biggest contribution of psychometrics is clearly testing. There is admissions testing colleges for graduate schools, military assessment testing which blurs into the testing for the recruitment of people, testing that’s used by clinicians or counselors to identify challenges their clients might face. There’s testing in schools. Turning testing into a reliable enterprise for producing useful numbers is a trick. First psychology and then quantitative psychology have done it in almost exactly the last century and it now pervades life in the developed world. For better or for worse. I had a history professor in college that would occasionally try and rock our boats by cynically observing that he thought the idea of progress was a myth. This idea that everything we do is progress and therefore “better,” though it’s not always 100% clear that it is better. Nevertheless, I’m a big fan of indoor plumbing; I think some things represent progress. There’s a video series I’ve used at times in class and the narrator, Philip Zimbardo—the author of one of the best-selling textbooks in psychology—just says “intelligence testing is what put psychology on the map.” Certainly in most psychology departments, especially Balkanized psychology departments, there are countless scientific contributions that other psychologists would consider more important, and those are all significant intellectual contributions, but for the world, testing is what psychology contributes. And is testing also going be the future of psychometrics or psychology? Interestingly enough, you can get multiple answers to that. I think, for better or for worse, testing will continue to develop and continue to be a thing that is done for placement in education and in jobs. In this same video I’m referring to, a cognitive psychologist takes the position that testing has outlived its usefulness and more understanding and details of cognitive psychology will render it superfluous. I will not hold my breath for that science fiction. Perhaps. Things change. But I think testing still has some decades, if not centuries in it. And do you have any personal plans for the future? When you get to be my age, one’s intellectual contributions become largely reactive. It’s all one has time for. I jokingly refer to my job now as “read and review.” People send me something, I’m supposed to read it and write comments about it. It honestly doesn’t leave me the kind of time that I had when I was younger and could set out to write a computer program. One stops programming as one ages, not because you can no longer do it but because you’re simply not allowed to. I still work on collaborative projects. For 10 to 15 years, we have been working on outcome measures, and there is some research and writing involved with that.9 I Reeve, B.B., Thissen, D., DeWalt, D.A., Huang, I-C., Liu, Y., Magnus, B., Quinn, H., Gross, H.E., Kisala, P.A., Ni, P., Haley, S.M., Mulcahey, M.J., Charlifue, S., Hanks, R., Slavin, M., Jette, A.M., 9
166
12 David Thissen
supervise what the graduate students do with the data, coming up with new ideas, which is all part of the group enterprise. There are some committees that consider the validity of the national assessment of educational progress in the United States, which is a national assessment, and a state test. We do research on the validity of these tests. I’m one of the people who either gets asked those questions or becomes part of a group that’s asked those questions, so like I say, my work is now a form of reacting. Back to the path of least resistance. That has certainly been an important theme. Thank you for this interview!
& Tulsky, D.S. (2016). Linkage between the PROMIS pediatric and adult emotional distress measures. Quality of Life Research, 25, 823–833.
Chapter 13
Bill Stout
“I was always fascinated by human behavior. We’re wildly idiosyncratic creatures, with our intelligence, our passions; it’s a fascinating thing.” Bill Stout is emeritus professor of statistics at the University of Illinois at Urbana- Champaign, where he spent his entire career and was president of the Psychometric Society in 2001. He wrote his dissertation in probability theory under Yuan Shih Chow’s supervision at Purdue University and finished in 1967. After becoming professor of mathematics, he switched midcareer to a career in psychometrics. His research interests are latent IRT unidimensionality, test bias, and diagnostic classification modeling.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_13
167
168
13 Bill Stout
Thanks, Bill, for participating in our project on the history of psychometrics. Basically, I will be asking questions about three themes; one of them is your career as a psychometrician or maybe a mathematician. I think I’m more of a psychometrician. Mathematics led me in a path to psychometrics. Other topics are the relationship between psychology and psychometrics, or psychometrics and other fields, and we’ll end with your view on the history and future of psychometrics. And I always start with the question: how did you end up in psychometrics? Yes, let’s see, that’s a nice question. The first half of my career was in mathematics, here at Urbana-Champaign where I’ve worked all my life. It’s a lucky thing actually; in America, there are more slots for people in universities, so it’s a little easier to stay in academia. I had just gotten to full professor around the mid-1980s. I was working on these “theorems,” as mathematicians call them, mathematical statements that you prove. Some theorems I proved still stand now. I was doing ok, and I even had some colleagues who I was working with. So, on the outside everything looked wonderful, except the passion was gone. At one point, we had a departmental meeting in the department of mathematics, and one of my colleagues, who was particularly passionate about his work, stood up and said: “I love my theorems, they’re like people, they come alive!” I was sitting at the back and thinking that that was not how I felt about my theorems at all. So I sort of knew that I had to do something, and I liked the idea of applications. If I had to choose between physics and engineering, I would probably choose physics, but then I would want to do engineering. Anyhow, here I was, knowing that I needed to do something with my career, even though I had gotten to full professor. I could’ve just stayed in mathematics for the rest of my life, but the university had something called “faculty study in a second discipline,” which was essentially a sabbatical. It was a competitive program for which you could apply, and I got in. Interestingly, mathematicians were pretty good at getting in, because mathematics interfaces with so many other areas. I applied for this program with the intention of studying psychology. I thought psychology was a really interesting field, and so I had this year to deeply study psychology and try to integrate it with my interests. Once I was part of the program, I ended up paying a lot of attention to psychometrics, and there was this psychometrician at the university, Michael Levine. People know him because he was one of these people who was just amazingly brilliant. Michael and I interacted, and I sat in on a course of his on psychometrics, and I basically drifted into psychometrics and never looked back. The Europeans, the Dutch mainly, were actually very supportive of me. After drifting into psychometrics, I figured out that the people in America who are in educational measurement are sometimes strong mathematically, but for a lot of them, even though they’re very good at what they do, mathematics isn’t their biggest strength. On the other hand, the Dutch psychometricians often have a very strong background in theoretical statistics and mathematics. The very first paper I
13 Bill Stout
169
submitted to Psychometrika1 was when Ivo Molenaar was the editor, and he was tough, and Larry Hubert was the editor of one of my other key early papers,2 and he was very tough too. These two would edit in great detail, change every third word I felt, and appropriately do so given the mathematician’s way I wrote back then. In America, up to the time of my shift in research interests, the research funding in mathematics and statistics was largely aimed at theory, so statisticians would often study fairly obscure, albeit foundationally important, theoretical issues. But then there was this big shift: abundant data came along, computers came along, and all of a sudden, statistics was very central to so much in science, especially applied areas. It had always been central to science, but it became much more so. So just when the research funding, from, for instance, the US National Science Foundation, changed from emphasizing theory to applied concerns, I had made the shift from working on the law of the iterated logarithm for martingales,3 which was very theoretical, to doing psychometrics and educational measurement with an eye toward applications. I’ve actually received funding from the US government or from testing companies all my life; even after I retired, it still continues. I’m PI on a US Institute of Education Sciences grant right now. With mathematics, I’ve always felt this angst because of its disconnect from applications. As an undergraduate student and when I began graduate school, I often thought I wanted to be a psychologist. In a sense I have done that, but if you’d ask me now if would I want to do my studies all over again and become a research psychologist, I would say no. I would be quite happy if my work would be applied in psychology, but I especially like the idea of studying educational testing: how you mathematically model it and how you do good measurement of it. For example, a student takes a 20-item quiz for half an hour, and the teacher wants to diagnose that student’s skills and misconceptions. The test is multiple- choice, so the teacher has the correct answers, and 15 out of 20 correct, say, is a pass. But there’s a lot of information in the particular pattern of her correct and incorrect responses. That’s what I’m doing now. I love the challenge of trying to tease out all of the educationally useful information that is in that test. Another aspect of testing that has fascinated me is test bias. You can ignore test bias and think you’re measuring one thing, the dominant dimension the test is designed to measure, while you’re actually measuring something else, such as soccer knowledge in a math word problem, and thereby you’re mistreating certain population groups. I’m drifting a little bit, but that’s how I got into psychometrics: dimensionality, test bias, and a little later diagnostic classification. Stout, W. F. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52, 589–617. 2 See Shealy and Stout Psychometrika reference in Footnote 8. 3 Stout, W. F. (1970). A martingale analogue of Kolmogorov’s law of the iterated logarithm. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 16, 279–290. Stout, W. F. (1970). The Hartman-Wintner law of the iterated logarithm for martingales, Ann. Math. Stat. 41, 2158–2160. 1
170
13 Bill Stout
When I made the switch from pure mathematics to psychometrics, there was no separate department of statistics at the University of Illinois. But we created a department of statistics that I very quickly became a member of. I’m no longer a member of the department of mathematics; I’m now in the department of statistics. When you chose that second discipline year, what about psychology appealed to you in the first place? I was always fascinated by human behavior. We’re wildly idiosyncratic creatures, with our intelligence, our passions; it’s a fascinating thing. Do I have a deep understanding of psychology? I would say no certainly. I have an undergraduate’s understanding of certain fields in psychology, but I guess I’m currently doing something that fits into cognitive science, which is certainly a field of psychology, and in particular the diagnosis of knowledge, as well as the lack of it. Do you consider what you’re doing with educational testing also psychology? Yes. I now officially work (remotely) at the Learning Sciences Research Institute at the University of Illinois in Chicago, which involves a lot of cognitive science research. Jim Pellegrino, a very well-known professor of educational assessment, codirects LSRI, and several curriculum specialists work there. So yes, I’m definitely doing a kind of applied cognitive science. The psychometric field I’m currently working in has various names, like cognitive diagnostic modeling or diagnostic classification modeling. If someone asks me what I do, I would answer: “I apply statistics to educational testing.” I’m interested in formative assessment, that is, assessing students not at the end of their program of study, but while they’re still learning, to try to assess what they’re learning and not learning, so as to help the teacher. I apply psychometrics to that challenge. And that has become a big deal research wise, and I guess I’ve had a little something to do with that. What exactly has become a big deal? Applying psychometrics to doing diagnostic classification. Let me just show you something. There’s a handbook coming out, published by Springer, called the Handbook of Diagnostic Classification Models.4 I helped write one of the chapters in that book, and there’s another article therein on some research that I’ve been involved in, and a couple of my former students have also written chapters. I’m making diagnostic classification sound a lot about me when so very many others are contributing to the field in very important ways, and diagnostic classification has become a big deal. Diagnostic classification tries to drill down and diagnose in some useful detail what students are learning and not learning. Psychometrics did IRT forever, and the latent space was just considered a unidimensional continuum, and people were interested in a person’s “theta.” The diagnostic classification latent Von Davier, M., & Lee, Y.-S. (Eds.). (2019) Handbook of Diagnostic Classification Models. New York: Springer. 4
13 Bill Stout
171
space is multidimensional. Consider an algebra unit, which tries to teach students 12 skills say, and there are 8 important misconceptions that are learning pitfalls. While the student is progressing through the course, you want to know what the student is learning and what the student is struggling with. That’s formative assessment diagnostic classification. Is diagnostic classification now being applied in schools? That is the idea, let’s see. I am not an expert in knowing how widely applied it is. If you talk to people in testing companies, they’ll all say that’s the goal, and they’re selling it. I’m not sure about the number of places where it is applied. We have a model that we think is going to be very effective in teasing out the maximum amount of information from the test, so we think we can help with that. I recently refereed a paper, from a couple of scholars in Iran, and then they’ve applied our RUM model to assess learning a second language.5 They have identified certain skills that exist in acquiring a second language, according to cognitive scientists and language specialists in Iran, and then they’ve developed an associated assessment test. They used RUM, which we developed several years ago, for assessing the resulting test data. I don’t think diagnostic classification modeling is being widely used yet, but there are many of us who do this kind of research. We all hope for more applications. By the way, it’s not an effort to get rid of the teacher. It’s not like a student can just go online to learn without having to deal with a teacher at all. Diagnostic classification is more about helping the teacher, so that he or she knows better what the students are learning and not learning. So the diagnostic classification is one of your main research pillars. It’s a developing pillar. But yes, for sure—it’s become my main interest. What are some of the other fields that you worked on? In mathematics, I worked on probability. Probability is kind of the backbone of statistics of course or rather one of its backbones. The first thing I worked on psychometrically was unidimensional IRT; I worked on a theoretical asymptotic underpinning and a method to decide whether the test data are really unidimensional or whether there are other major dimensions coming into play. I developed a procedure to assess “essential” unidimensionality,6 and Paul Holland was actually interested in that too.7 In fact, he did something from one viewpoint, and I did something from another viewpoint, and when we actually got together, it turned out we were kind of
Ranjbaran, F., & Alavi, M. (2017). Developing a reading comprehension test for cognitive diagnostic assessment: A RUM analysis. Studies in Educational Evaluation. 55, 167–179. 6 Stout, W. F., Nandakumar, R., Junker, B., Chang, H. H., & Steidinger, D. (1991). DIMTEST and TESTSIM [Computer program]. Champaign: University of Illinois, Department of Statistics. 7 Holland, P. W., & Rosenbaum, P. R. (1986). Conditional association and unidimensionality in monotone latent trait models. Annals of Statistics, 14, 1523–1543. 5
172
13 Bill Stout
looking at the same thing. I was very interested in when we can say that the latent space is close enough to undimensionality to be called unidimensional. Nothing is ever going to be perfectly undimensional, so I developed something called essential unidimensionality. I’m happy with that work, and it has evolved into some things. I was also very interested in test bias. Together with Robin Shealy, a graduate student of mine, we developed a theory to understand and to assess test bias.8 What did you do in the test bias field? We did two kinds of things. One was to develop a specific procedure, which we called SIBTEST, Simultaneous Item Bias Test. We emphasized that you shouldn’t only be looking at individual item bias, but at how items can produce bias as a group. Say a teacher is teaching fifth grade American students, and the teacher says: “I want to make this interesting for my kids, I’m going to make all of the items about professional American football.” Maybe less so today than 30 years ago, but this would tend to be biased against girls. If several questions have a context of sports or football in particular, our idea was that for a test to be seriously biased, it wouldn’t just be one item that is biased; it would be several items producing bias collectively. We took the viewpoint that what was really going on is that there’s another “nuisance” dimension being assessed. You think you’re measuring mathematics, but you’re really measuring mathematics plus this other dimension coming in from sports knowledge differentially mastered. We did some theoretical work to stress it was a multidimensionality phenomenon, as referenced previously. Has DIF now found its application? In some places, yes. All of the testing companies in America, probably in the Netherlands too, do what they call a DIF (differential item functioning) analysis. Is our procedure heavily used? I don’t think so; I think it’s being used in some places like measured progress. The Mantel-Haenszel procedure,9 which is what Paul Holland worked on, is heavily used and is very effective. I think SIBTEST should be used more, and I’m glad there are still papers coming out about it. One of the things I’m really pleased about, which also helped greatly to advance the research that I was interested in, is that I’ve had so many good doctoral students, some exceptionally good students and indeed better than me. You interviewed at least two of them: Brian Junker and Hua-Hua Chang.
Shealy, R., & Stout, W. F. (1993). An item response theory model for test bias. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 197–239). Hillsdale, NJ: Erlbaum. Shealy, F., & Stout, W. F. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/ DIF. Psychometrika, 58, 159–194. 9 Holland, P. W., & Thayer, D. T. (1986). Differential Item Performance and the Mantel-Haenszel Procedure. Paper presented at the 67th Annual Meeting of the American Educational Research Association, San Francisco, April 16–20, 1986 (one of many sources on the Mantel-Haenszel). 8
13 Bill Stout
173
Do you think you had a different view on things than other psychometricians, due to your mathematical background? Yes. When I did the work on unidimensionality, I was interested in proving theorems about what happens asymptotically. What happens when the test length gets longer? What happens when the number of students taking the test gets larger? Large sample theory is what they call it in statistics. My mathematical background made me think that, whatever I did, it needs to have a certain theoretical rigor. When I put it that way, it almost sounds pompous—psychometricians tend to be rigorous too—but I was especially sensitive to the idea that there needs to be a kind of mathematical rigor. Also, my strength was my probability background. In fact, if someone asks me about my greatest strength, it’s certainly not developing statistical procedures, although I’ve been involved in that. I think that my greatest strength is developing probability models and especially statistical models that describe what’s really going on in the data. Of course, models are imperfect descriptions; if they were perfect, they wouldn’t be models. The idea of developing models drives me. I also like the idea of developing statistical procedures, but I’m just not a sophisticated statistician. I mean, I came from mathematics, then to psychometrics, and finally to applications in educational testing, so I know something about statistics, but I’m not on the frontiers of statistics; I’ve never been, and I never will be. You prefer to do work that has some kind of application. Absolutely, I’d like to work on something that at least has the potential for application. Just to go back, as I stated previously, I proved something in mathematics called the law of the iterated logarithm for martingales. A martingale is a model of fair game, which is of interest to a lot of mathematicians. This law of iterated logarithm was best possible, which means that there is no way to improve on the result because it’s just the best you can do. This result is an asymptotic result, like the law of large numbers. The law of large numbers comes into play because if you toss a coin 50 times, the proportion of heads you observe is pretty close to a half if it’s a fair coin. The things that the law of iterated logarithm tells you about the real world would apply if you had like a billion observations, say. So, it’s not practical. I’m not interested in working on something any more unless it seems to have the potential at least of being practical. It’s still my dream that the diagnostic assessment modeling we have been working on over the years actually will make a difference in the classroom: much nicer rather than having the work remembered with “this is beautiful stuff,” which is something a mathematician might say. Do you collaborate with psychologists in your work? I interact some with cognitive scientists, like Jim Pellegrino; his focus is psychological and educational assessment. I do not work actively with psychologists; I probably work more with educational researchers. One of my former students, Jeff Douglas, now in my department of statistics, does psychometrics. Another student, Hua-Hua Chang, does half psychology, half education (I’d say), and a third student,
174
13 Bill Stout
Louis Roussos, is 100% in educational assessment (I’d say). I mean, in America you find that the people who are interested in psychometrics and measurement are mostly in education and psychology departments, but a few of them are in statistics departments. I think many people would consider psychometrics as some kind of aid for the psychologist, to help them in the analysis. Their own slide rule! Yes, I think that’s true. Do you think that psychometricians also contribute to psychological theory? I claim no personal role in that, but absolutely! In psychology, you can study the individual person and go deep; you don’t always have to be doing work that has a statistical flavor. I guess psychologists debate about this, and there are fancy names for whether you’re interested in populations or interested in just studying individuals very deeply. But if you’re going to study groups of people, or if you’re going to study mental aspects that are shared across people, you have to look at data across people (lots of “subjects”). And as soon as you look at such data, you need some sophistication in statistics. Maybe computer science and visualization help too, but you need some expertise in dealing statistically with data. Now, there are plenty of people in psychology departments who have real expertise in applied statistics, but this is where psychometricians are necessary because of their specialized knowledge. I remember in our University of Illinois psychology department, there was a conversation between one of the psychometricians and a psychologist who was in community psychology, and they were on a committee together. I knew both of these people: one was the stereotypical psychometrician, kind of reserved and very precise, and the community psychologist was this very verbal American type. The psychologist said: “I’ll supply the words, and you supply the numbers.” The psychometrician just looked at him like he was slightly crazy. What I mean is, when there’s data, you need to be able to analyze it, and that requires both some sophistication in the math, the model, and the methods you use and also some substantive sophistication in the very modeling of it, how you describe what produced that data. Sometimes the qualitative and the quantitative types do not mix well together, even though both are needed. Do you think psychologists model enough? I have no idea. I don’t know enough about what psychologists do. I don’t read psychological journals. You could also ask whether psychologists use good enough techniques to analyze their data. There are some people running around and saying “you’ve analyzed this data all wrong.” I don’t know the answer to that question either.
13 Bill Stout
175
The reason I ask this is because I think people have a lot of opinions about how psychologists do research, and as you said, when there’s data a psychometrician might be necessary, or at least, the person doing the analysis should know enough about it to do it well. In an ideal world, if you’ve collected data from a bunch of subjects, and you’ve measured your variables, and also when that data is multidimensional, there are all kinds of statistical questions that are relevant. For example, how did you do your sampling? And of course, the researcher has a theoretical perspective: a psychologist thinks that A, B, and C might be going on here, so the question is how to link this theory with the data. Modeling is kind of that intermediate connection between data and theory. You might have interviewed Bob Mislevy; he’s very into how and what you can learn from data. I interviewed him a couple of days ago. Alright. Good guy! Yes! I was wondering; do you mainly focus on psychometrics, or do you also incorporate techniques from other fields? The main thing that mathematics has brought to me was the ability to reason mathematically when that’s important. In this diagnostic classification grant that I have now, there are three of us. There is Bob Henson who brought his educational measurement perspective, and he will just say he’s not a mathematician. He’ll sort of joke how “you guys supply the mathematics.” The other colleague, at the University of Illinois in Chicago, was mathematician Lou DiBello—and I say was because he unfortunately passed away recently—who was a mathematician, an algebraist, and my field was probability. The theoretical stuff we’re doing now has some algebraic aspects to it, and when Lou was no longer able to do the work, I had to dive in and do some algebraic stuff. Now, this was not advanced algebra, not the kind a Ph.D. student would do, but in its own way it was kind of deep, and I had to really struggle to do the algebra. There were some deep questions about modeling; we were, for instance, worried about identifiability (different parameterizations producing the same model) in some sense. I struggled, and it was a slow process, and it was sometimes painful, but I did get there, and I think my mathematical backgrounds helped, even though my research never had an algebraic aspect to it. Is that kind of what you’re getting at? Well, I think some psychometricians might say that they’re heavily influenced by data mining techniques, for example, but you’re not one of those, right? No. In fact, I don’t even program. Bob Henson, who is at the University of North Carolina at Greensboro, and the graduate students on the grant do the programming. We analyze data using something called Markov Chain Monte Carlo, which is a very sophisticated way of looking at the data and trying to figure out the parametrically complex model that’s producing it, estimating parameters, and so on. I couldn’t
176
13 Bill Stout
do it; I’m not a programmer. Data mining, as I see it, and this is probably an amateur’s view of it, is looking for patterns in the data. The probabilistic perspective is to go at the data with a model in mind. You develop a model, which these days can be very complex with a lot of parameters, and then you hope that that model describes the data well (“fits the data”), and so you worry about lack of fit. But the whole data mining approach, that’s not what I do, though it’s important. You want to actually explain the data; you want to know the driving force behind the data. We have this latent class model, a multidimensional latent class model with a lot of unknown parameters which describes how we think examinees respond to items. And we have this latent space of examinees, with maybe seven skills and four misconceptions, say. Suppose the test is 40 items long, and we’re scoring the test not right or wrong, but we’re also looking at which incorrect response has been given by the student, i.e., “nominal scoring” each item. Ideally, in the future, though it’s maybe beginning to happen like with the Iranian study, the test will be designed so that you are careful not only about the stem of the item and the correct response but also about the incorrect responses. The incorrect responses can have a lot of very useful information, so that requires a sophisticated model. In the referenced handbook, there are lots of approaches discussed in various chapters by various researchers. I think we have a good approach (the Generalized Diagnostic Classification Model (GDCM)10), obviously, but I think there are other good ones too. As a matter of fact, in the beginning of this handbook, the first three chapters deal with three very different approaches: One is a Dutch approach stressing Mokken scaling, something that Dutch psychometricians have heavily developed. Another one is the Almond and Mislevy Bayesian Networks approach, and Bob must have talked to you about that. One of them is ours, GDCM: RUM. Then, there is Kikumi Tatsuoka’s rule space method11— and many others! This field is strong and robust enough now. It’s not like there are just one or two approaches; there are many of them. There are 31 chapters in this book about all different topics people are working on. Twenty years ago, it wasn’t like that, but now the field is exploding. You mentioned the Dutch approach. I’ve never really considered the Dutch approach to be different from other approaches, but do you consider Dutch psychometrics to be different from American psychometrics? No, I wouldn’t want to view it that way. I have an affinity for Dutch psychometricians. I also think of Jim Ramsay in this category, even though he’s of course not Dutch. These are people who stress rigor and are very theory driven. They do things like Mokken scaling. The American approach, coming out of education DiBello, L. V., Henson, R, & Stout, W. (2015). A family of generalized diagnostic classification models for multiple choice option-based scoring. Applied Psychological Measurement, 39, 62–79. 11 Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based on item response theory. Journal of Educational Measurement, 20, 345–354. 10
13 Bill Stout
177
departments, is often a little less formal and less statistics theory driven. It has real value; I’m not trying to create a hierarchy here, but the Dutch are kind of more rigorous. Do you know what your most cited paper is? I have no idea what my most cited paper is. Let me reframe your question. “Of the work you and others have done in psychometrics, which work has had the most lasting impact?” Is that ok? Sure! Let’s see. My work on unidimensionality was pretty influential. I don’t know where it will go, but it hasn’t died. Others have also done some pretty good work in that area, but I’d probably rank that number 3. The DIF stuff may end up being somewhere important, especially the modeling test bias as a multidimensional phenomenon. I’m 79, and I’m fortunately blessed with good health so far, but those things could change. I went online, and I did one of these things where you describe the various components of your health and lifestyle, and then a projection is made: the silly thing said I would reach 97, which I can’t imagine! But actually, I’m assuming I will be able to be active in research for another 5 or 10 years at least—I simply enjoy it, and my university supports it. So anyhow, what’s the point? The point is that I think we may have kind of nailed it with our modeling approach to diagnostic assessment. There are other good approaches there, Mislevy’s work is very good in its own way, and it has had a major impact. When you ask him the question about impact, I’m sure there are thousands of citations. Just thousands. But I think my diagnostic classification work has the most potential, because of how we model multiple-choice items. There is the cognitive aspect: each of the response options of an item has a certain appeal to the examinee because they go through a certain cognitive process, but the options compete with each other, and sometimes the examinee throws up his hands and guesses. We’re trying to untangle guessing and option competition and what’s going on with each option in a cognitive way. I think what we have there is pretty good. It’s simpler than what Bob does, and that it is simpler in some settings may actually be beneficial. And I like the European/Dutch stuff, the Mokken scaling. By the way, my work on unidimensionality was nonparametric, so there was some affinity between the Europeans and me because of their nonparametric work. Who has really inspired you? It can be someone who lived a long time ago or someone you admire now. Wow, let’s see. I’m going to change your question just slightly. I’m kind of inspired by a lot of psychometricians in a way. They’re nice people, serious about their work, concerned about applications. I like the field; I work with nice people; I don’t like to work with “un-nice” people. But I think Jim Ramsay is remarkable. Jim and I kind of became friends back in the day although we’ve lost contact with each other.
178
13 Bill Stout
Anyhow, I admire Jim a lot. I like Jim’s ability to think outside the box, to be very theoretical, and yet his work could be very applied. Wim van der Linden is also very impressive. He definitely functions as a statistician; his interests are very broad, and his work is very applied. My major professor was very inspirational. Yes, we haven’t really talked about him! Yes, Yuan Shih Chow. By the way, this is interesting thing about America: at one point, I had as many Chinese Ph.D. students as American students, and there wasn’t anybody from anywhere else. But Y.S. Chow was a wonderful advisor. He was a mathematician, a probabilist. He had his standards. He was very supportive. I don’t know how to explain it, but when you’re a Ph.D. student, you’re kind of vulnerable in a way. You have all these uncertainties, and he was always very supportive of the work I did. The direction he pushed me in led to the work I did after my Ph.D., which got me to full professor, which then enabled me to go smoothly over into psychometrics, so in some sense, I owe it all to him. He had some very strong ideas about the way probability should be taught to graduate students. I was his first student, and in the end, he said, “I must’ve done something right because Stout did some nice stuff in measurement.” He had a strong influence on you. He definitely had a strong influence on me. I was also very inspired by Walter Philipp, one of my former mathematics colleagues. He is now deceased. We had a wonderful working relationship, and we were very close friends. That was nice. Even though I was drifting away in terms of my own appeal for probability, Walter and I worked together and spent a whole year on one project: a memoir of the American Mathematical Society.12 My friendship and my personal interaction with Walter were enormously influential. At the same time when I left the field, he stayed in the field. He and I came together in what was a new thing in probability at the time, but I left, and he went on and became a giant in the area. Walter was a good guy. This is on the personal side. There’s symmetry to things. My previous wife and I introduced Walter to his wife. Walter is now deceased, and my current wife worked with Walter’s wife, and after I lost my wife, Walter’s wife introduced us, so, payback or something, Walter was a very good friend. We worked together a lot, and that was inspirational. Ivo Molenaar was nice to me, and I enjoyed interacting with Ivo. We have different personalities; he is much more reserved than I am, but a very decent man. Wim van der Linden supported my becoming president of the Psychometric Society. By the way, you can take this out if you wish, but I’m just tossing this in: I was one of the worst Psychometric Society presidents ever! I’ll tell you why. When I was president, my wife contracted cancer, so I spent that whole year living at the hospital. I
Philipp, W., & Stout, W. F. (1975). Almost sure invariance principles for partial sums of weakly dependent random variables. Providence, RI: American Mathematical Society. 12
13 Bill Stout
179
stayed there day and night. Not the best of times. I think if you look back at the number of people attending the Psychometric Society Meeting that year, it was one of the lowest numbers ever! I’m sure they’ve forgiven you. I think so, that’s why I tell you the story. I don’t feel guilty about it; you do what you need to do. When you look at psychometrics as a whole, if that’s at all possible, what do you think is its biggest achievement? What really appealed to me about psychometrics was factor analysis. One of the things I like is this original idea of standardized testing, to make things fair. Let’s have an objective measure of what the students know and are capable of, so it doesn’t matter whether the student is x or y or z; what they know and do is important. Factor analysis and IRT certainly are part of the effort to look at these tests and try to make them function as they’re supposed to. And what’s also appealing is to discover things about these methods that weren’t originally realized, like doing diagnostic assessment, and then to develop the theory to deal with it. The little reading I originally did about the history of psychometrics was on this whole notion of how many dimensions of intelligence there are; it’s kind of fascinating to read the literature, it has led to some wonderful things. Do you also think it has led to some... Bad stuff? Done right, it should be okay. The Bell Curve book and the idea that certain races are more intelligent than other races—I wouldn’t blame that on the psychometricians; this book oversimplified a very complicated subject—nature, nurture, and so on. Historically speaking, there were some moments in time that were not very positive episodes. Yes, but then, I know so little that I don’t even want to comment on that. But yes, that’s true. But look at these physicists who have developed the atomic bomb; you will also have to ask them these questions. I don’t consider it as blame; it’s just that in a lot of fields, there were some developments that happened, which in retrospect seem less positive. Certainly, there are uncomfortable truths out there, ones we must not ever forget. Some, totally abstract, science comes along, using statistics or psychometrics, and very rigorously and carefully establishes an uncomfortable truth. That’s not misapplied, that truth, but if it’s a truth, we need to know it. Like the problem of climate change. If we have global warming, which I’m 99.999999% sure we do, and humans are a major factor in it, we better know it.
180
13 Bill Stout
I always end with the question: What are your own plans for the future? Are you going to continue the diagnostic assessment? I know so yes! I’ve become very narrow, in what I do, and I don’t even try to keep up in a lot of areas except for diagnostic assessment. For example, in New Zealand, they have a national standardized test, driven by some good psychometrics. I’ve learned a little bit about that test; I’m imagining what could happen in the future to that test with the insertion of some diagnostic modeling. I would like to continue to work in in diagnostic assessment. I want to partner with people who are in a position to really apply it, and I would like to see it widely used in some settings. I want to solve the theoretical problems and work with people who can do the programming, and maybe work with people who have access to school districts, or even national testing programs. Now this, this is the dream of course. You should dream! Yes! What’s the probability that that all happens? Large enough that I’m willing to keep working on it. And if it doesn’t happen, will I be deeply disappointed? No, definitely not. I work full time now, and I proceed at a certain pace. I’m different than Wim van der Linden, who can do a whole bunch of things at the same time, or Hua-Hua Chang who is travelling all over the world spreading educational assessment. I have a limited capacity to do gobs of broadly focused work, but I want to stay in this narrow area because I think it’s very important and I hope to see it applied. And I’m going to try to find ways to interact with people who can help it be applied. The whole New Zealand thing, it’s a daydream of mine to involve diagnostic modeling, but it’s a daydream that has a possibility greater than 0, so I’m willing to entertain it. Who knows? I said it was the last question, but I’m actually going to ask one more, which I should’ve asked earlier. You lied! I did, but this is the final question, I promise! Is there a hurdle psychometrics still has to take? Is there something we should overcome in the future? There was a feeling 10 or 15 years ago—I remember talking to some people about it—that the stuff in Psychometrika had become too narrow and too technical and didn’t have the breadth of application. I imagine when you talk to people like Mislevy and others, who really have a sense of the field, they may say things like “psychometrics needs to learn how to take its very good theory and synch it with the applied problems that are out there.” I don’t know how big an issue that is, but to me, as a psychometrician, I want to know that my work will be applied, whether it’s in education or psychology. Interestingly enough, we got our start on our diagnostics work, up at the University of Illinois in Chicago, through the medical school. A biostatistician, Robert Gibbons, who considers himself more on the psychology side
13 Bill Stout
181
than biological of things, thought it was a good idea because this diagnostic stuff could have applications for medicine and could help us get funding. Some people have medically focused work done in psychometrics, on dimensions of mental illness or certain kinds of addictions. I know very little of that, and I don’t even have a whole lot of interest in that—I’m focused on education and focused on testing and formative assessment—but if I were to develop some new approach to factor analysis, I would care far less about it being “really beautiful” than that somebody would say, “now we can go out and do some things we weren’t able to do before.”
Chapter 14
Jacqueline Meulman
“We shouldn’t give up our carefully built-up practices in statistics to careless massive computation just because we are able to do so.” Jacqueline Meulman, emeritus professor of applied statistics at the Mathematical Institute at Leiden University and emeritus professor of statistics (Adjunct Faculty) at the Department of Statistics at Stanford University, was president of the Psychometric Society in 2002. Meulman earned her Ph.D. in 1986 under the supervision of Jan de Leeuw and John van de Geer at Leiden University. Her research interests are nonlinear multidimensional data analysis, clustering, statistical learning, and data science methods. Jacqueline Meulman received the 2020 Psychometric Society Career Award for Lifetime Achievement. © Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_14
183
184
14 Jacqueline Meulman
I always start with the question: how did you end up in psychometrics? That’s actually an interesting story. I never really got involved into psychometrics in the narrow sense. Like many psychometricians, I started by studying psychology (in the early 1970s), because the curriculum offered a lot of freedom with interesting topics to choose from, but I had no idea of what sort of specialization I wanted to be in at the end. I chose psychology because I didn’t want to study anything like mathematics or physics or chemistry, because in those days, they were considered very conservative studies. Psychology, on the other hand, was very popular, and so was sociology. However, I thought sociology was way too vague, and I didn’t really want to do that. Psychology, on the other hand, was very broad; there were lots of things I could study with a psychology degree, although it was never my intention to become a psychologist per se. After a couple of years, I became interested in the history of psychology, especially in the history of psychometrics, which was, I think, my first step in psychometrics. I was given a teaching assistant position, and I had to write a chapter as study material on the history of psychometrics, so I delved into that. Excellent topic. I fully agree! On the one hand, I was amazed how many bad things had happened in the early days of psychometrics, especially before the First World War. I was flabbergasted. The bad image of psychometrics among outsiders persisted for a long time; a very good American colleague and friend of mine once told me that she was sometimes asked: “What is a nice Jewish girl like you doing in a Society like that?” On the other hand, I was intrigued by the mathematical background of the methods I was reading about, like factor analysis—one of the major accomplishments in the history of psychometrics. So I was very intrigued by the methods themselves, which urged me to learn more in that area. Then I learned about multidimensional scaling being another class of techniques that had a history in psychometrics since the early 1960s and multivariate analysis with optimal scaling being another one. John van de Geer had started his own department, the Data Theory Department—the most wonderful place for research—and I became a research assistant there in 1978. The Department was never just dedicated to psychometrics; it had a much wider view. I attended all sorts of courses during my Master, which filled up a background in statistics and mathematics. Of course, if I wanted to graduate, there had to be a general theme to this list of courses, and I wound up with a degree in mathematical psychology and data theory. That’s why I never really considered myself a psychometrician in the narrow sense; in the rest of the Netherlands, psychometrics was very closely identified with IRT and test theory, and I never really studied those topics in any depth.
14 Jacqueline Meulman
185
Mathematical psychology and psychometrics used to be one category and then at one point they diverged… Yes, that’s true, but in my case “mathematical psychology” was just a name we chose for my degree. I think the only place where you could officially study mathematical psychology in the Netherlands in the late 1970s was Nijmegen. They had an official mathematical psychology department. I did a “combined course program,” which needed a name that had something to do with psychology, because that’s where I had officially started. John van de Geer’s Chair was called “Mathematical Psychology and Data Theory,” so that’s why we chose it. We are now in the Mathematical Institute at Leiden University, but you said that initially you didn’t want to do mathematics because of its conservatism. That was in the days of the student movement in the late 1960s and in the early 1970s. Things have very much changed since then. When you started studying psychology, did you prefer the more mathematical or statistical side of things? After a few years yes, but not necessarily in the beginning. I always liked the statistics courses in the curriculum, but after a couple of years, after writing about the history of psychometrics, I started to attend courses on linear algebra, multivariate analysis, and multidimensional scaling. I became very enthusiastic about those topics, so I figured that that’s what I wanted to do. And continued to do so. What about it was so interesting to you? I think the more exact side of it. I loved programming; it was one of the things I learned very early on (since 1978), and I really enjoyed it. When I was a master student, I was already programming in different languages and on different platforms. The Data Theory Department really encouraged that; it had already started writing its own user-friendly programs by then. Those programs were sent out, all over the world, on big tapes, with accompanying users’ guides. It was truly wonderful. When I received my degree, I had the opportunity to go to the USA, to work at AT&T Bell Labs in Murray Hill, NJ, for a year as a consultant, where I did more programming. Looking back on my career, that was the most amazing year. Bell Labs was one of the most famous research institutions in the whole wide world, and there were so many incredible statisticians there. They were my heroes from papers I had read, and suddenly they were around, and you could just talk to them! The world was much smaller in those days. Nowadays, people travel and go to conferences all over the world, but in those days, not many people went abroad for such a long time. This was in 1982, so a long time ago. It was such an incredible period. I worked with Doug Carroll; we were working on a program called PREPMAP-3, and Doug took me to conferences. That’s how I wound up in the Psychometric Society actually. There was a conference of the Psychometric Society in Montreal
186
14 Jacqueline Meulman
that year, when Jim Ramsay was President. I gave my first talk there. Suddenly, you were in the middle of an all-American group; there were not many people from the rest of the world in those days. I met many people, and I was very proud to become a member of the Society. I really felt at home, because the topics that were then important in the Psychometric Society were much broader then than they turned out many years later. As you said, you are not an IRT person or a Structural Equation Modeling person, which are the main topics right now. No, but I met Peter Bentler and Karl Jöreskog, several of the originators of SEM, which was quite something. The next step was going back to Leiden where I obtained a Ph.D. position, which was called a “doctoral candidacy” in those days. While I was working on my thesis, I became homesick for the USA. After that first year in the USA, I thought I had seen enough, and I wanted to go back to the Netherlands and Leiden; I felt isolated in New Jersey and suburbia. But when I was back in Leiden, I really missed New York, and I wanted to go back. I even missed the suburbs! I met Doug Carroll at a conference in the Summer of 1983, again a Psychometric Society conference—this time in Paris, one of the first international meetings—and Doug asked whether I would like to come back for a couple of months. So I did. I loved everything about being there, even being in suburbia. I had my little rental car, and I drove to New York in the weekends. It was so exciting. Working at Bell Labs shaped the rest of my career. Meeting all these people gave me a head start, and they involved me in many things afterwards. Since then, I have always been active in international societies. There was always the competition—though maybe “competition” is not a good term—between two societies that were important for me: the Psychometric Society and the Classification Society of North America. The Psychometric Society only met in North America, and at some point, every other year, there was a meeting in Europe. In 1985, the International Federation of Classification Societies was founded, and influential American psychometricians like Doug Carroll, Joe Kruskal, and Phipps Arabie were involved in that effort as well. I considered those two groups similar, more or less, although the Psychometric Society was still very US-oriented in those days. Did the Psychometric Society focus on different topics than the Classification Society? Although the Psychometric Society had always been involved with test theory and IRT, both societies were concerned with areas like multivariate analysis, multidimensional scaling and clustering. Later on, those topics became less important in the Psychometric Society. This is also quite visible if you look at Psychometrika over the years.
14 Jacqueline Meulman
187
What happened next? I got my Ph.D. in 1986, being in the Data Theory Department. Since then, I’ve always been very lucky to receive major grants. First, I obtained a fellowship for 5 years from the Royal Netherlands Academy of Arts and Sciences (KNAW), which was prestigious and thus a huge honor. I became an Associate Professor in 1992, and in 1994, I received a Pioneer Award, from the Netherlands Organization of Scientific Research (NWO). That was very unique, because it was only awarded to one person in the whole field of social and behavioral sciences and only every other year. The volume of the grant was like a Vici1 these days and it was only given to one person. That’s special. That was very special, and the grant gave me the opportunity to have my own research group for 5 years, so I started to appoint Ph.D. candidates, a postdoc, and an assistant professor. What was your own specialization? From the beginning, I worked on multivariate analysis with optimal scaling; that was my major topic. For my thesis, I combined optimal scaling with distance-based models.2 It was the perfect integration of multidimensional scaling and multivariate analysis. It was both distance-based and oriented towards categorical, ordinal, and nominal multivariate data, nonlinear relationships between variables, and then combined with optimal low dimensional representation. This research was largely influenced by my experiences at Bell Labs and was already carried back to Leiden by other people that had also been there, like Jan de Leeuw and Willem Heiser. At Bell Labs, I was lucky to work with Paul Tukey, who was a distant nephew of John Tukey, and Paul explained to me how I could program in S. John Chambers, one of the creators of S, was also working there. The very popular statistical programming language R actually originates from S. The S language was very much oriented towards visualization, having lots of functions to make plots and graphs, just like R nowadays. So my focus has always been on developing the so-called nonlinear methods, including the associated programming. It had hardly anything to do with psychometrics in the narrow sense. Actually, Paul Tukey once asked me: “why do you say you are a psychometrician? You should say you’re an applied statistician.” Come to think of it, I think he said “statistician,” not “applied statistician,” since statistics at least at Bell Labs was much more applied than academic statistics in the Netherlands in those days.
A Vici grant is a Dutch grant for senior researchers, which holds 1.5 million Euros. Meulman, J. (1986). A distance approach to nonlinear multivariate analysis. Leiden: DSWO Press.
1 2
188
14 Jacqueline Meulman
I also read that you did work in biostatistics, for example. Is that a more recent focus of your research? No, I’ve had an interest in biostatistics for many years. I never really felt attracted to topics from psychology. Why is that? I really don’t know; I’ve just never been very captivated by them. I was much more interested in new developments in biology and medicine, for example, in cancer research. I’ve been working with researchers involved in cervical cancer research as early as 1990, and I could apply the Leiden multivariate methods to very interesting data. So, in terms of topics, you’d rather work with biomedical researchers than with psychologists? I never did collaborate much with psychologists, which may sound strange because my original background is in psychology. I collaborated not only with biomedical researchers but also with researchers in analytical chemistry. I’m just wondering what the hesitation is. Is it because the type of data is different than biological data? Many things happen just by coincidence. In 1999, I met some researchers who were working in the field of proteomics, metabolomics, and systems biology. I was developing new methods for the analysis of high-dimensional data, and those methods could be applied to those areas, and that’s basically how I became involved there. So, on the one hand, my own interests, abilities, and skills were more oriented to that area; and on the other hand, it was also a matter of the people you meet and whether there’s a scientific click. When the chemistry clicks, then you like to work together. You did not consciously decide that you were not going to do psychometrics. Psychometrics in the narrow sense was never an option. During the period of my Pioneer Award, I became Professor of Applied Data Theory (in 1998), so there was no close link there with psychology either. After about 8 years, I was offered a one- day appointment in the Mathematical Institute in Leiden. That was quite groundbreaking in those days. There were only a few persons in the institute who did statistics, although there had been a very strong history in mathematical statistics, but there was not much applied statistics. However, in those days, there was a group at the Mathematical Institute that collaborated with others in Leiden and the Netherlands in a cluster, funded by NWO, called “NDNS+.” NDNS stands for “Nonlinear Dynamics of Natural Systems,” and the “+” mostly refers to statistics. At some point, the Dean of Faculty of Sciences asked me to come and join the institute for one day a week, sponsored by NDNS+. I thought this was incredible. And as soon as I was in the Mathematical Institute, I didn’t want to leave anymore.
14 Jacqueline Meulman
189
At some point I moved completely to the Mathematical Institute. Making the transition from Data Theory in the Faculty of Social and Behavioral Sciences to the Mathematical Institute was not easy. As I mentioned earlier, one of the things we did in the Data Theory group was developing our own user-friendly software for data analysis, and at some point, we had the opportunity to develop a special package for SPSS. They bought the license for distribution, but Leiden University owned the intellectual property. This was in 1989, and I’ve been involved with that project ever since. For one day a week, I still meet with my SPSS group, and we work on new programs and include new developments in statistical methods in our software. The primary focus of the software is on nonlinear optimal scaling methods for dimensional reduction and prediction, and at some point, we’ve also included statistical learning methods. It really took some effort to move the royalties obtained from the license from one faculty to the other, but in the end, it happened in 2012. I started a whole new group at the Mathematical Institute. The statistics group, which first consisted mainly of Richard Gill and myself, expanded enormously, but that was also due to Aad van der Vaart, who was also appointed in 2012. He first received a senior ERC grant and later a Spinoza Award. So the statistics group started growing and growing, and nowadays this whole hallway is filled with statisticians. We also started a Master program in statistical science, of which I’m the director.3 We included a specialization in data science in 2017, and the master is really blooming. Lots of students can apply from all sorts of areas, so you don’t need a Bachelor in mathematics—but of course students need a sufficient background in statistics—and we have students from all over the world. Before our program, there was no other program in applied statistics in a Science Faculty in the Netherlands. There are several psychometrics groups which jointly do a similar program (in Utrecht), but that master program is in the Social and Behavioral Sciences. Could you say that you’ve grown a little apart from psychometrics? That you’ve now become more of an applied statistician? Who was it again who said that? It was Paul Tukey, in 1982, who said that I was a statistician. So I can hardly say that I’ve now become a (applied) statistician. I guess I have been one for about 40 years. Do you consider yourself a psychologist? As I said earlier, I’ve never considered myself a psychologist. At some point I became president of the Psychometric Society; I was elected in 2001 and was president in 2002–2003. I really enjoyed it, and it was a big honor. Since 1982, I had been going to the Psychometric Society meetings, almost every year to the Annual Meeting held in the USA and always to the biennial, additional meetings in Europe.
Jacqueline Meulman has been director of the Master Statistical Science from 2009 to 2020. The name of the Master will be changed to Statistics and Data Science. 3
190
14 Jacqueline Meulman
At first, all meetings were held in North America, and Members from the Board of Trustees, who could sometimes be very conservative, said that there should always be an annual meeting in North America, because it was written in the bylaws. I thought that the need for a North American meeting every year with an additional meeting outside the USA every other year was not optimal, to put it mildly. When I was president during the 2003 Meeting in Cagliari, Italy, it was the first time there was not a second meeting in North America, and it has been that way ever since: one year there is a meeting somewhere in North America, and the other year there is a meeting somewhere else in the world. I think that has been very beneficial for the Society, and it may have been my biggest accomplishment as President. If you look at the Psychometric Society meetings, they have become much more diverse. First, the meetings outside North America were only in Europe, but in 2001 the Society held its first meeting in Japan. I was actually involved with the organization of that meeting, together with a group mostly associated with the Behaviormetric Society and Behaviormetrika in Japan. More diverse in terms of the countries that are participating? Yes, very much so. The meetings held outside North America attracted a completely new audience, not necessarily psychometricians, and so also the topics became much more diverse. But you said in the beginning of the interview that you thought the Psychometric Society had actually narrowed down. Yes, I feel that happened for a while, but, also due to the international meetings, it widened again. The meetings in North America are not so diverse as in Europe, Japan, Hong Kong, and Chili, for example. I have big hopes for New York though; I predict that lots of people will join. Hopefully, there will be presentations on many different types of topics too. I think that’s a very positive development. You consider it’s a positive development, that the Psychometric Society is broadening up again rather than focusing on the classical topics like IRT. Yes. One of the things I see—who am I to say it, but it worries me—is that the work of the “real psychometricians” is very technical, high-quality, and advanced, but not many people apply it in practice. There are even testing companies or agencies which use nothing that comes out of the psychometric community, and I think that’s very bad for the profession. On the other hand, it is maybe understandable, because, when you look at the articles in Psychometrika, they are of very quality but are mostly technical. They have little to do with applied psychology or applied psychometrics. There are people in the Society that really want to bring psychometrics back to society and offer applications in schools and testing.
14 Jacqueline Meulman
191
Would you encourage that movement? Yes! At some point, I was asked to be part of it. I was asked to start working for ACT Next. I thought about it, because I was very honored, of course, and it was a very nice proposal. I could do almost anything I wanted to do—start a research group, or live anywhere I wanted as long as it was in the USA. But I declined; it’s still not close enough to my area. Too test-focused. Yes. They said that they also wanted to do more statistical learning, so I ended up being a consultant rather than an employee. I like to bring people together, so when, for instance, Jim Ramsay really wanted to do something like a new way of testing and change the field, I brought him in touch with ACT Next. What do you think about the relationship between psychometrics and psychology? Should psychometrics help psychology become a stronger science, or are they more or less two separate fields doing their own thing? I’m afraid the latter is the case; over time, they have developed as two separate fields, and maybe that has always been the case. In the very beginning, psychometrics was very applied, but then it took off to be quite theoretical, and these days hardly any psychologist reads Psychometrika. I’m not exactly sure what’s happening now, but when I used to look at the methods that psychologists use, I became so annoyed: their methods were so traditional! Psychologists would still use methods like analysis of variance or t-tests. Good heavens! Their response then is that if they would use other methods, their research would not get published. Well, if you never try anything new, things will never change. I think the average psychologist cannot read Psychometrika. I think many psychologists would struggle with those articles. Psychometrika is one big mystery. For many applied statisticians, The Annals of Statistics is also too difficult to understand. The American Psychological Association started the journal Psychological Methods, and there are much more advanced methods in that journal. I know from experience, from work with one of my Ph.D. candidates who wanted to publish a paper there, that it took a long time to get the paper accepted because the article was much too difficult, too technical. I felt we had to sit on our knees to explain our research to psychologists. But I see now that it worked: psychologists read Psychological Methods, and I see many applications of the techniques from our paper.
192
14 Jacqueline Meulman
There are of course psychologists who are interested in methods, and maybe they tend to read Psychological Methods rather than Psychometrika. That’s fine of course. I actually don’t think Psychometrika should change to be a psychology journal. I think it’s fine as it is, and they shouldn’t have to change the name. We have the same problem in statistics: when I was President of the VvS + OR, the Netherlands Society for Statistics and Operational Research, many people told me that we should change the name of the Society. They wanted to call it something like “Data Analytics.” That’s way cooler of course. I always thought that was ridiculous. Like calling Statistics Data Science instead. Our area is statistics, and we should be proud of it. It might be that the term “statistics” can put some people off, so be it. Of course, we can always join forces. I like the joint term “Statistics and Data Science.” But should something change? Psychometrika should be more open to other approaches from statistics. I think that if the quality is high enough, Psychometrika will be a good outlet for such papers, but the problem is that many other statisticians don’t read Psychometrika. It’s actually funny that there have been occasions in the history of psychometrics where statisticians reinvented methods that psychometricians had already written about. These statisticians didn’t know these inventions were ours, they published their own work, and now these methods are all attributed to the statisticians. That’s a pity, if you see that the credit is put somewhere else. If you don’t care about citations or credit, it matters less, but I think it is a problem. In those days you could say that you weren’t familiar with other people’s work— there was no Internet, no Google. In earlier days, the only way you could see what was published in a year was in the Statistical Index which listed all the published articles from statistical journals, including Psychometrika. But nowadays, you cannot really maintain you don’t know about other people’s work. But it still happens. There is an enormous citation bias. I remember a very famous article by Breiman and Friedman,4 which was about alternating conditional expectations (ACE). It was paper of the year in the Journal of the American Statistical Association, but that specific method had existed in the psychometric literature for quite some time since Forrest Young, Jan de Leeuw, and Yoshio Takane started working on alternating least squares methods.5 On the other hand, when Breiman and Friedman found out about the work in psychometrics, they
Breiman, L., & Friedman, J. H. (1985). Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80, 580–598. 5 Young, F. W., De Leeuw, J., & Takane, Y. (1976). Regression with qualitative and quantitative variables: An alternating least squares method with optimal scaling features. Psychometrika, 41, 505–529. 4
14 Jacqueline Meulman
193
did acknowledge it. It still doesn’t give Young, De Leeuw, and Takane the citations and the influence. ACE is still better known than MORALS or CANALS, but at least it’s now visible that psychometricians were first. Andreas Buja wrote a very nice paper a long time ago about these similar developments in the Annals of Statistics.6 I’m actually at the beginning of a new phase in my career, since my mandatory retirement is coming up. Luckily, I also have an appointment at the Department of Statistics in Stanford as adjunct professor. That does not mean “Assistant Professor,” as sometimes people erroneously think. It is a title meant for professors or professional statisticians who have a main appointment outside Stanford or are retired at another University. John Chambers, for example, is also an adjunct professor in the Stanford Statistics Department. I first visited Stanford in 2001 and have been officially associated with the department since 2009. I visit there a lot; I’m in Stanford about a little over four months a year. A lot of out-of-office replies! Well usually, I typically forget to switch them on. It is great to be associated with the department, of course, a big honor. There are so many famous statisticians there, like Brad Efron, Trevor Hastie, Rob Tibshirani, and Jerome Friedman, just to mention a few members of the department with whom I interact most. I learn a lot from them, but on the other hand, they sometimes learn something from me. All in all, it has been a long path from being a psychology student to being a member of the Statistics Department in Stanford. A diverging path? No, I think, in retrospect, it seems like a converging path. You ended up where you wanted to be. Yes! I always tell everybody that I’ve never been happier, both in Leiden and in Stanford. People sometimes ask me what I’m doing in a mathematical institute since I’m not a mathematician, and I always tell them that it’s the most wonderful institute in the world, except for Stanford Statistics maybe. The atmosphere is wonderful, and the people are giving each other the chance to do what they’re good at. They’ve given me all the opportunities to start new research, set up a new group, and do all the things I wanted to do. That really has been great.
Buja, Andreas (1990). Remarks on Functional Canonical Variates, Alternating Least Squares Methods and Ace. Annals of Statistics, 18, 1032–1069. 6
194
14 Jacqueline Meulman
When you look back at your career so far, what do you consider your most influential work? I guess the most influential work would be the multivariate analysis with optimal scaling and multidimensional scaling methods that are part of SPSS Statistics, which is now owned by IBM. Of course some methods were originally developed in the Department of Data Theory when I was at the start of my career, but I’ve taken the software development to a professional level, and over the last 20 years, the original methods have been replaced with completely new ones, containing many new ideas and statistical innovations. You may think that such software development is not very exciting, but there you are wrong. At the same time, the royalties generated by the project have created many opportunities in statistical research, for myself and my group. It’s a very big project; there are so many options in SPSS Statistics that include our programs, that we (as Leiden University) receive an enormous amount in royalties per year. That is a big deal for Leiden University, and they are very proud of it. So, the multidimensional scaling and multivariate analysis with optimal scaling tradition started in Leiden a long time ago, and in my more recent work, I combined it with developments in statistical learning. In my 1986 thesis, the focus was on distance-based methods which I stopped doing because it was computationally too intensive (optimal distance approximation between objects under optimal scaling/ transformation of variables), but it was quite innovative. However, I sometimes think I could pick that up again. Computers are so much faster and the data so different. Nowadays, data are often very high-dimensional, with many more variables than objects, and parallel programming is extremely feasible. Another example, I’ve worked on a distance-based clustering project, together with Jerry Friedman at Stanford. It resulted in a very influential paper which was published in the Journal of the Royal Statistical Society B with discussion.7 That paper is still an inspiration for many people who work in that area. One of my Ph.D. students, Maarten Kampert, picked the topic up again, and he made several new, exciting contributions in his thesis.8 And when you actually look at the history of psychometrics, what do you consider the most important achievement or contribution that psychometrics has made? Is there perhaps a certain article or book you thought was really great? I really should have thought about this before you came to me for this interview, but I think it’s very clear historically that factor analysis was one of the major Friedman, J. H., & Meulman, J. J. (2004). Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66, 815–849. 8 Kampert, M.M.D. (2019). Improved strategies for distance based clustering of objects on subsets of attributes in high-dimensional data. http://hdl.handle.net/1887/74690. This monograph won the 2020 Psychometric Society Thesis Award. 7
14 Jacqueline Meulman
195
contributions of psychometrics to the field of applied statistics. And extensions of factor analysis gave us structural equations modeling, which can be seen as precursors of neural networks. Also, multidimensional scaling (MDS) has been reinvented every now and then, but the major contributions were really made by psychometricians. Modern MDS was invented in the 1960s; factor analysis is of course much older. Thinking along a completely different line, I think that the development of tests for the army in the USA in the First World War has been very important for the beginning of psychometrics. It is a history that we sometimes cannot be very proud of, if you look back at what was written at the time. But developments were very influential and really made psychometric testing an important field. Some of the underlying philosophies, of some of the psychometricians involved, were not very politically correct, to put it mildly. Earlier, Francis Galton in the UK, one of the originators of correlation and regression, and inventor of one of the early mental tests, was a eugenicist. And eugenic ideas found a fruitful ground in the USA in the beginning of the twentieth century and had their influence on psychometrics, but not on all psychometricians. And of course, nobody in those days could have anticipated what happened many years later in the Second World War. I think making that connection would be a complete misinterpretation of the course of history. Is there someone, a specific psychometrician or statistician, who has really inspired you? For me personally, that would be Doug Carroll. He was the one who gave me my consultant position at Bell Labs; he invited me back, and he introduced me to so many famous people who became my friends. That was very unique. Other people that became my friends later on, like Larry Hubert and Phipps Arabie, were also very influential. I collaborated with them very intensely in the 1990s, while I also had a position as adjunct professor at the University of Illinois in Champaign- Urbana. We wrote two books together.9 I would say the entire group at Bell Labs was very inspiring, but if I have to select one person, it would be Doug Carroll. He died not that long ago, and I still miss him. This is a little personal note, but I must say that it’s somewhat difficult for me to attend the Psychometric Society meetings. On the one hand, I’m very happy that I see so many young people, to see that the area is blooming. But on the other hand, I miss my old friends. Of course, I’m getting older myself, and it’s unavoidable that my old friends are not coming anymore, but at one point I almost knew everybody, and now I only know a few. So nowadays, I usually go to the joint statistics meetings (JSM), which are a more in my area these days. And I try to go to meetings that combine statistics with data science. But I did go to the IMPS meetings in Beijing and Asheville. In Asheville, Larry Hubert and Willem Heiser organized a session about the most cited papers in Psychometrika,
Hubert, L. J., Arabie, P., Meulman, J. J. (2001). Combinatorial Data Analysis: Optimization by Dynamic Programming. Philadelphia: SIAM. Hubert, L. J., Arabie, P., Meulman, J. J. (2006). The structural representation of proximity matrices with MATLAB. Philadelphia: ASA-SIAM. 9
196
14 Jacqueline Meulman
which was fun. I gave a presentation on the multidimensional scaling papers by Kruskal10 and Shepard.11 So, I’m still attending the IMPS meetings, after not attending for a couple of years. There are so many interesting meetings, and you have to make choices. What do you think is psychometrics’ biggest achievement? Well, as I said, factor analysis was a major research area which stimulated a whole tradition. Structural equation modeling follows out of factor analysis of course, and that’s still going strong. I would say that neural nets and deep learning have a precursor in structural equation modeling. Should psychometrics become more oriented towards those new movements, such as neural nets? I would say so. Of course, you always have to find your own niche in those fields. Historically speaking, psychometricians look at things somewhat differently than someone from machine learning. I never wanted to be a psychologist, but having studied psychology, I did learn about psychological research, how difficult it is to collect data, and how to do data analysis. These days, data are considered very important, but if you look back at when I was a student, lots of statisticians in academia had never seen real data, and were not using computers, believe it or not. In psychology, I already learned to work on a computer terminal in 1972: an IBM typewriter that was connected to an IBM mainframe. The software system had been developed for the statistics courses in psychology. Very advanced; I’m not sure such technology was available at all the universities in the Netherlands, but we certainly had it in Leiden. I think that was certainly influenced by developments in the USA. I never regretted starting out in psychology in Leiden, learning to work with data and computers, and winding up at the Department of Data Theory, because that gave me some kind of head start. What is psychometrics’ biggest challenge for the future? What is a problem that we haven’t solved? I think psychometric testing should really try to be more associated with real agencies in society that deal with testing in schools. It’s not my area, and I don’t want to be involved myself, but there are so many knowledgeable people in psychometrics, and not using their work is a waste of talent. As a parallel, in data science, we statisticians are saying that we shouldn’t leave data science only to the computer
Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1–27. Kruskal, J. B. (1964). Nonmetric multidimensional scaling: A numerical method. Psychometrika, 29, 115–129. 11 Shepard, R. N. (1962). The analysis of proximities: Multidimensional scaling with an unknown distance function. I. II. Psychometrika, 27, 125–140; 219—246. 10
14 Jacqueline Meulman
197
scientists because it’s our area. We, the statisticians, care about data analysis, explaining the separation of the signal from the noise. Perhaps it is the same for psychometricians: they should say, “psychometrics is our area, and testing originated from us, and we should claim it back.” I’m amazed by the things I see on the Internet—major agencies that do testing and have no clue what psychometrics is all about. I think encouraging the test industry to use modern psychometrics is a real challenge. We talked about it when I was in the Board of Trustees, but I haven’t seen much movement yet. On the other hand, I haven’t really been following what’s going on if I’m honest! Perhaps I’m too pessimistic. What are your own plans? Do you have anything you’re working on at the moment? I’m still really eager to develop new methods for the interesting data that we are seeing nowadays. I don’t believe in the analysis of big data as such. I really think that’s a hype, and we statisticians realize that it is not necessary to analyze all the available data all at the same time. What we are good at is sampling, we know sampling theory, and we know that it is wise to take samples out of very big data, so we can do much more careful analyses, getting lots of benefits like variation diagnostics at the same time. At least, I adhere to the vision that we in statistics should not go with these people who are implying, “my data are bigger than your data!” Your data are 10 TB or even 10 PB? Well, I would say, good luck! Big data are usually very noisy data. And mostly, they are not representative of the population, and that is something we should always realize. I think we should concentrate on dealing with data in a much more sensible way. Naturally, you cannot ignore a field like data science, so together with colleagues in computer science, the Leiden Centre of Data Science was started in 2014, of which I’m still co-director. There are now many students in data science projects, and I enjoy that, but I would say that we shouldn’t give up our carefully built-up practices in statistics to careless massive computation just because we are able to do so.
Chapter 15
Willem Heiser
“I consider myself an amateur psychologist, but my interest is in psychology, that’s certainly true.” Willem Heiser, emeritus professor of data theory at Leiden University, was president of the Psychometric Society in 2003. Willem Heiser finished his Ph.D. under the supervision of John van de Geer and Jan de Leeuw in 1981. His expertise lies in multivariate analysis, multidimensional scaling, and methods for classification. He also has an interest in the history of psychometrics. © Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_15
199
200
15 Willem Heiser
Welcome, Willem Heiser, thank you for your participation in this oral history project. In this oral history project, we’re interviewing psychometricians, past presidents of the Psychometric Society, on three main topics. One is your career as a psychometrician, the relation between psychometrics and other fields, and ultimately the history and future of psychometrics. So I will start with the first question: how did you end up in psychometrics? When I went to Leiden, I wanted to become a medical doctor, so I started out as a medical student. I finished my first year, this was in 1967, and the strange thing was: if you did medical school in that time, the whole first year you didn’t see one body. You had physics, chemistry, biology, and statistics, but it was only in the second year that “the body” came into the classroom, and believe this or not, but I wasn’t expecting that. Meanwhile, I had also become active in the student movement, and I was interested in changing the university and changing the way studies were done. A good friend of mine at the time said, “why don’t you stop with the medical thing, why don’t you come to psychology”; it’s also about people and doesn’t involve a body. No blood! No blood! And I decided to do that. There was a lot of action at the time in the psychology departments, in 1968. So I actually became a student activist in the psychology department. Then in the course of doing that, there were certain negotiations about how the students would influence their programs, and I met John van de Geer, the guy who was at that time the dean of our faculty; he had to negotiate with the students. He made a big impression on me; he was very factual and wanted to make arrangements, and he wanted everyone to be happy with the new arrangements. In the old university, the full professors had all the power. After 2 years of psychology, I became interested in quantitative psychology, because of Van de Geer, because of the friends that I had, and also I think because of an experiment we had to do. And then one of the professors said: “you have a whole story about a mechanism about why people are doing what they are doing in your experiment, which I think is all speculation, but you did the analysis of variance alright so I’ll give you a good grade.” That was a sign for me that speculation was not appreciated, and statistics were appreciated when you wanted to be an experimental psychologist. That was also a reason for me to become more interested in quantitative psychology. Only later I learned that not only in experimental psychology but also in clinical psychology a lot of interesting things were going on. People were testing people, and you could do a lot of statistics there. That meant that around 1973, I fully specialized in psychometrics, and I finished the study in 1975. You were also fascinated by clinical psychology, and you still are. Did you take that with you as a psychometrician later on? Yes, sure. I am fundamentally convinced that psychology and psychometrics are about individual differences. Cognitive psychology does not study these individual differences so much, and now that we have brain studies, it is also not much about
15 Willem Heiser
201
individual differences, and it will become hard for them to study the brain as an individual difference-generating thing. I believe that the kernel of psychology really is the differences between people, because they are so very big, and they are also influenced by so many factors, environmental factors, genetic factors, and social factors, so it is by far the most interesting part and the most central part of psychology. Has that become your mission as a researcher? In a sense it has. At a certain point I specialized in multidimensional scaling, which in principle is something that one could do with data without any individual differences. In traditional MDS data, you usually average across people. But there is a certain type of MDS, which is called unfolding, where you have individual differences that concern preferences of people across a number of objects, and it models the individual differences between people in terms of their preference for certain stimuli. Can you give an example of a study? Yes. An example is, for instance, a study about “smell” stimuli. You offer people different types of smell. So maybe flowers, or perfume, or something that smells bad. And you ask them to order the smells in order of preference. Then it turns out that there are considerable differences across gender—of course gender always generates individual differences—but also age, and you can do unfolding on those types of data. Traditionally, political preferences also generate a lot of individual differences. There is a whole area, called sensometrics, where the smell example is from, and they do a lot more than only that. Was MDS the main direction in your career? Certainly during the first 15 years. It was also the topic of the dissertation I wrote in 1981.1 In the 1980s I was also involved in the Gifi group, which was a group of researchers from Leiden. They were active in multivariate analysis in general, combined with optimal scaling, which means that you try to rescale the data or transform the data in such a way that the model fits even better than without transformation. In this Gifi period, I got more interested in more general multivariate data analysis. You also had a second supervisor, right? One was Van de Geer, whom I already mentioned, and the other was Jan de Leeuw. Jan de Leeuw was more the technical supervisor, and I learned a lot from him.
Heiser, W. J. (1981). Unfolding analysis of proximity data. Dissertation. Leiden, The Netherlands: Leiden University. 1
202
15 Willem Heiser
Did Van de Geer bring in the psychology side of things? Yes, but he was also the boss. He was an example of a very interesting person, and I got a lot of inspiration from him. Then you continued with research on multivariate analysis, first MDS and then the GIFI book.2 Gifi is the name of an actual person. Do you know the story? Albert Gifi was the servant of Francis Galton. He actually served Galton for I think more than 20 years, in all kinds of things privately, but also with manuscripts and so forth. But when Galton died, it turned out that he gave most of his money to the eugenics stuff and only a few hundred pounds to Gifi, so Gifi was seriously underappreciated; that’s why we chose him. But it also refers to Galton and to the fact that multivariate analysis, factor analysis, and correlation, all these concepts, started out with Galton. He was a major inspiration? For me not so much, but for the group he was. Who was a major inspiration for you, apart from your supervisors? After my dissertation Doug Carroll invited me for a postdoc at Bell Labs in 1982. I had met him before, but him inviting me at Bell Labs was one of the highlights of my career, and I learned very much there. I also met Joe Kruskal at Bell Labs, and he is one of the heroes of multidimensional scaling. The third one was Clyde Coombs; he wrote a book in 1964, A Theory of Data,3 and our department at the time was called “data theory,” which was clearly inspired by Coombs. Van de Geer met Coombs for the first time in the early 1950s. John van de Geer’s idea to make a department of data theory was inspired by Coombs too. The book by Coombs was all about ordinal data, what to do with ordinal data, and categorical data, and that has been an important influence, while the influence from Kruskal was more technical. Kruskal showed how to do least squares in MDS and how to combine a model: MDS is a distance model, with transformations of the data in the same function. And that is an important ingredient of most of my work. I mentioned Doug Carroll, not only because he invited me but also because I learned a lot from him, on how the way science is done, that science is a social thing. I didn’t know that before: I thought science is what you do in your little room behind your computer. What did Doug Carroll mean with “science being a social thing”? He had an enormous network and corresponded with a lot of people. He took us out to conferences, he introduced me to people from the States, and he’d say: “This is Albert Gifi (1990). Nonlinear Multivariate Analysis (Eds. W. J. Heiser, J. J. Meulman, G. van der Berg). New York: Wiley. 3 Coombs, C. H. (1964). A theory of data. New York: Wiley. 2
15 Willem Heiser
203
Willem from the Netherlands, he wrote a very nice dissertation.” He introduced me to refereeing, that there is a review process when you publish a paper, and how you can do that. He had a very big file of papers, and he also showed me how to keep track of the literature. Van de Geer or De Leeuw never said much about it. De Leeuw said, for instance, “you don’t read books, you scan them.” That was a typical “de Leeuw remark.” Why was that typical for De Leeuw? He likes to be flippant and a bit sarcastic. Have you ever received any criticism on your work? If you submit papers, you get criticism. Was there a serious theory against yours? Not specifically, it is a relatively small specialization. There were people like Wayne DeSarbo, who also worked on unfolding, and he published a lot, and he did a lot of work also of good quality, so he was kind of competition, but not criticizing what we did. He wrote a lot of papers that I would like to have written myself.4 So that was more or less healthy competition? Yes, healthy competition, that’s a nice word. Looking back now, have you ever had a moment when you questioned your own research career? Yes, of course. I wrote my dissertation in 1981, and we’re now 35 years later, and I still didn’t solve the unfolding problem; I consider unfolding still an unfinished problem. So, how to do unfolding? How to do unfolding in the best way, nonmetric unfolding—I also consider it as a topic that is not sufficiently appreciated by the outside world. We wrote a good paper in 20055 which was sort of solving the problem, and our program for unfolding is distributed at a major software company, so I cannot complain about that. But still it is true that there are only limited areas such as sensometrics and marketing where unfolding is done, and I’d like to see this improve.
DeSarbo, W.S. and Carroll, J.D. (1985). Three-way metric unfolding via alternating weighted least squares. Psychometrika, 50, 275–300. 5 Busing, F. M. T. A., Groenen, P. J. F. & Heiser, W. J. (2005). Avoiding degeneracy in multidimensional unfolding by penalizing on the coefficient of variation. Psychometrika, 70, 71–98. 4
204
15 Willem Heiser
Do you feel that there are fields in psychology that are overlooking the unfolding method? Currently, I have a project where we look for an unfolding-type model for an IRT situation. I should explain this maybe. The traditional unfolding model is person oriented, which means that it gives a preference across a number of objects, and if you see this in a matrix, there are persons in the rows and objects in the columns. Original unfolding is a row-conditional model. Each person has a certain preference function, and you build a model on that. But there is also an area where the items, so the columns of the table, are considered, which happens in IRT, and in traditional IRT the response curves are monotonically increasing, like in the Rasch model or in the two-parameter logistic model. There is a small subarea in IRT of people who say “there can be certain materials where it is useful to do a single peaked response function.” So if a person moves along the continuum, there will be a certain optimum for which object is maximally preferred, and then the preference goes down again. This is a subarea, and I consider it a subarea of IRT. That’s even more a niche. Sounds like one! That’s why I’m looking for some more action. Fair enough. A lack of interest in single-peaked item responses also has to do with the way that in psychology item analysis is done. Because if you have single-peaked items and you do a classical test theory thing with your items, then the single-peaked items will not correlate very well with the total score, so they’re usually thrown out. I still have a whole program for the new future to change that. We’ve already talked about this a little bit, but it sounds like you’re a psychometrician that still cares a lot about psychology; do you consider yourself a psychologist still? Some of my best friends consider me a psychologist. “You’re more a psychologist than a statistician.” But my work is rather technical, so, not psychological at all. I consider myself an amateur psychologist, but it’s certainly true that I have an interest in psychology. You want to use your technical knowledge and apply it to psychology. Yes. I have a rather extreme idea of psychology. I think that people are statistical machines that operate in the environment. I know certain ways that I can combine my interest in statistics and psychology by thinking about a person who is moving around, who has to take decisions, who has to take actions, or maybe who has to fight or flight—a lot of that is driven by statistical mechanisms. That’s one of my favorite ways of looking at psychology.
15 Willem Heiser
205
Have you ever been able to do something with that perspective? No, but I still have 30 years! I read that you’re a professor in psychology, statistics, and data theory. That was until my retirement 2 years ago, and that was because that was an appointment in the department of psychology. They wanted to make the appointments always inside psychology plus something extra. The other psychologists, the clinical and the cognitive, and the social psychologists just considered me as the quantitative guy. I don’t think they considered me a psychologist so much. When you compare yourself to other psychometricians in general, do you feel you relate more closely to psychology than other psychometricians? On average? In my work certainly not, because my work is rather technical. There is also this distinction that there are people like people from Leuven or from Amsterdam, but also in the United States, who are interested more in mathematical psychology type of approach, where you model a psychological process, which is also not my interest. So my interest in psychology is sort of a private thing, I like to read psychology books at home. I like to think about it, I like to relate to other people here in the department, talking about psychology with them, but my work is mostly technical or synthetic. I try to combine things. What do you think should be the relation between psychometrics and psychology? Are they two distinct fields? Well, I have a metaphor of a big river system. A river system starts with small little rivers, and that’s where I see various disciplines, like biology, psychology, economy, econometrics, and chemistry. Those are the areas where people do quantitative things. Sometimes they invent something for themselves which is also useful for others, and then these techniques which are invented in a substantive area go down the stream to the big river. The big river represents statistics, so to speak. That’s where everything ends up. And the small rivers represent areas like biometrics, psychometrics, chemometrics, and cliometrics, what have you. In that metaphor psychology is one of those areas where quantitative research is done, things with test scores, and nowadays they have of course fMRI data or diary data, when people walk around and fill out answers to questions on their iPhones. If there is a quantitative problem that has to be solved, I’d say psychometrics is the first group of people to look to because they have affinity with psychology, they know the type of data psychologists have. But what you see happening in fact, and an example is fMRI, is that an fMRI machine comes with a whole bag of statistics already in it, made by engineers, made by statisticians, who knows by whom exactly. So there is a whole technology coming in that already includes statistics. And psychometricians are not involved in that, so to speak, which is a pity. The interventions in the experiments
206
15 Willem Heiser
that psychologists do are still psychological interventions, and the outcome variables, apart from the brain data that are measured, are typically psychological outcome variables. I think both for psychology and for psychometrics, we should take care that we take part of the action here, in the innovation. Psychometrics and psychology should be more active in seeking each other out. Yes, that’s one of the challenges of psychometrics, not to keep looking at test scores, ordinary questionnaires, and so forth, where the whole IRT business is going on, but also to look at new data. Another important example of new data is diaries or longitudinal data for a single person. I consider that also as an important development that we have to catch up with. Of course there are people like Peter Molenaar who are already doing this. And what do you think is one of psychometrics’ main achievements? Principal component analysis, not factor analysis! Factor analysis and principal component analysis—those are the typical ones. But I don’t necessarily believe in latent variables. That’s typically something in psychometrics that is considered important, but that’s very hard to export. Outside of psychometrics there are not many statisticians that believe latent variables are important. And I believe that this is due to the fact that these models, like factor analysis or IRT models, have parameters that are estimated. Of course every statistician knows that that’s one of the main things in statistics, to estimate parameters, so there’s a common ground. But you don’t need to consider, for instance, an IRT model as a latent variable model; you can also just consider it as a model for a rectangular table where you have rows and columns, like in the analysis of variance, two-way analysis of variance, when your outcome variable is coded 0-1. Then you have this feature that it is the same person who answered various items; so there is a correlation, which creates the problem that the observations are not independent. The solution for these problems is classical, and I don’t believe latent factors, or latent variables, are the most important thing, but I do believe that the marriage between, let’s say, the factor analysis tradition, latent variable tradition, and the path model tradition which we have in structural equation modeling is a typically important contribution of psychometrics. What made that development happen? And what were the consequences of that invention? The first Psychometric Society meeting that I had outside the Netherlands, in 1978, was when Karl Jöreskog was the president, and he gave his presidential address6 which was a general talk about structural equation modeling. I must say that that address made quite an impression on me, although it was not my specialization.
Jöreskog, K. G. (1978). Structural analysis of covariance and correlation matrices. Psychometrika, 4, 443–447. 6
15 Willem Heiser
207
Karl Jöreskog’s advisor was Wold, Herman Wold, and he was a very good econometrician. He was the father of partial least squares, where you have a path model like in structural equation modeling, but no latent variables, no indicators, and Wold used to fit this model by least squares. Van de Geer was a friend of Wold, and when I was a young student, Wold came to Leiden, and I could talk to him, and this was also for me an interesting experience. Wold was not very convinced in the beginning that Jöreskog had a good idea. Jöreskog came to Wold, and he said, “I have this idea and I’m going to do maximum likelihood instead of least squares,” and it seems that Wold then had said: “yes of course you could try, why don’t you, but I don’t think you can always assume that all variables are normally distributed and so forth.” Ever since, students of Jöreskog have introduced a lot of generalizations: we now have structural equation models for almost anything. That development took for the largest part place in psychometrics, by people that were connected to the psychometric community. We know that factor analysis can be considered a special case. So I think this is a big unifying theme for psychometricians. IRT is a special case of structural equation modeling; maybe even MDS is a special case! It’s interesting: earlier you said that you’d rather see PCA as psychometrics’ biggest achievement, not SEM. Yes, to stay in the observed space! That’s your personal preference. That’s my personal preference. Not to draw any theoretical conclusions that you cannot be sure of. The observed space is also easier to communicate to people outside psychometrics. As soon as you start talking about latent variables, you see the eyebrows going up. The statistician’s eyebrows. Yes, the normal statistician’s eyebrows. I’ve worked with biostatisticians. I’ve taught a course on statistical consulting, together with a biostatistician, and the students had to talk to the clients, who mostly worked at the hospital. So they had medical analysis problems, and a lot of these problems also involved several questions that were asked in certain patient interviews. But the biostatisticians never came with the answer “let’s do IRT analysis on this.” No, he said, “let’s do a generalized linear model, or a mixed model, on it.” So, within the community of biostatisticians, the same analyses are done, just as the special case of the generalized linear model, which is an important overarching concept of course. And you don’t need to know about latent variables to do fitting there.
208
15 Willem Heiser
What role has psychometrics played in society? What are its contributions to society? Well of course, educational testing, which is big in the United States and in the Netherlands, though not so big in other countries, but that’s a clear example. I see a trend or tendency in other areas in psychology, such as social psychology and clinical psychology, that they start doing less testing. Clinical psychologists still do testing, but there is a big trend that psychologists want to do experiments. The tests should be quick. They don’t want to do a whole battery of tests, so they start doing very short tests, consisting of a few items, and keep relying on classical test theory, so there the role of psychometrics is kind of decreasing. So what we have to do is to show that there are other areas of interest to behavioral scientists where we do have contribution to make, like the ones I mentioned before. One other thing I want to mention is that of all scientists that have an influence in society, psychologists tend to be too introverted, too modest. They like to remain in the corner. This is sort of multiplied if you’re also a psychometrician, because then your language is so difficult. Speaking for myself, I’ve had a couple of PhD students who work on problems in mathematics education, how children multiply and do division problems. Those PhD students, like Marian Hickendorff, have influence in society, because there is, at least in the Netherlands, an active discussion about the best way to teach mathematics in schools. And in Amsterdam, of course, there is the Rekentuin,7 which I also admire. It’s a great project, where there’s an interactive system getting data out of mathematics problems for children. When you think of modern ways to do teaching, there’s also an avenue for influence of psychometrics. I’m glad there is an avenue! I want to continue with one of your personal hobbies, the history of psychometrics. First, I want to talk about your own contribution to the history of psychometrics. So I was just going to start with the question: what is your most cited paper? That’s a technical paper with Jan de Leeuw, about multidimensional scaling in 1980; I think it’s called something like “Multidimensional scaling with restrictions on the configuration,”8 and I’m happy that it’s cited a lot. It was more a Jan de Leeuw paper; I was at the time the younger, or less experienced, of the two. But it had the foundation of other papers that followed, so it was also influential in my own work. Talking about history, when I started doing this, maybe 15 years ago, I gave a presentation on the history of psychometrics, and I received a lot of response; I received more response from the historical presentation than I have ever had from a De Rekentuin, or Math Garden, is a computer-adaptive testing program for mathematics skills in school children (www.mathsgarden.com). 8 De Leeuw, J., & Heiser, W. J. (1980). Multidimensional scaling with restrictions on the configuration. In: L. Kanal & P. R. Krishnaiah (eds.), Multivariate Analyses, Vol V. Amsterdam: NorthHolland, 501–522. 7
15 Willem Heiser
209
technical paper. That was encouraging. I think I wrote a paper for a Japanese conference about the history of distance measurement.9 That paper starts with the Greeks, who tried to measure the distance from the earth to the moon and who tried to measure the circumference of the earth by very simple means and by applying geometry; I found that particularly interesting to see. Distance measurement is one of my interests because of multidimensional scaling, which is also a distance model. It turned out that part of this history happened in Holland, because in the seventeenth century, Snellius, a guy from Leiden, was involved in distance measurement across reasonably large distances; let’s say distances between the cities of Holland, so between Amsterdam, Alkmaar, Haarlem, Leiden, and Middelburg. And you’d do that by measuring angles. So Snellius was here in Leiden, standing at his own house, looking at the Peter’s church here in Leiden and another church in Zoetermeer, a close-by town, and he’d do angle measurements from different points. When you have all these angle measurements, you can form little triangles, and with a little geometry, you can measure the sides of the triangles, up to a constant, using sines and cosines and so forth. He actually started with a very small triangle within Leiden, which he could measure by just traditional small distance measurement, and then he could do this for the whole system. Snellius did this in the seventeenth century and made a map of Holland, based on this type of measurement, and it was the first real good map in terms of giving the right distances between all these places. This whole idea in geometry was done in France and later in Japan and all over the world, all the way to the Pacific, in typically the same way. And then in the early nineteenth century, Gauss was interested in the same type of geodesic measurement. In fact his theory about least squares, which he’s famous for, was not about a straight line: the example he used in his paper was not about a straight line.10 We immediately think least squares, straight line, or regression, but that was not invented yet (it was Galton who invented regression). Gauss encountered the problem of triangulation. He gives an example of least squares based on the triangulation of Friesland and Groningen up in the north of the Netherlands, and that is the first least squares solution to a triangulation problem. Thirty years ago, someone compared the Gauss least squares solution with the Snellius solution, and then it turned out it was only 5% better than Snellius had done in terms of precision. So, you can see, the first ideas about distance measurement were already very good, and everything that followed was more sophisticated, more precise, less standard errors, but you know, already a bit marginal! That’s what I like about studying the history of science—that some ideas started rough, but then at the same time, the first cut is the deepest. That’s what I like about it.
Heiser W.J. (2003), Early Statistical Modelling of Latent Quantities: The History of Distance Measurement by Triangulation. In: Yanai H., Okada A., Shigemasu K., Kano Y., Meulman J.J. (Eds.) New Developments in Psychmetrics. Tokyo: Springer Verlag. 33–44. 10 Heiser, W.J. (2003). Early statistical modeling of latent quantities: The history of distance measurement by triangulation. In: Yanai, H. et al. (Eds.), New developments in psychometrics. Springer, Tokio (pp. 33–44). 9
210
15 Willem Heiser
In the history of psychometrics, what is according to you the most influential book or article? I’d say the Cronbach’s alpha paper, in Psychometrika, in the early 1950s, 1951 I think.11 It is very heavily cited, over 25,000 citations, and the main reason, I think, is because it’s applied all over the place. It’s cited a lot in psychology of course, but also in medical science and all kinds of other areas, Cronbach was influential and still is. The strange thing about this paper is of course that the coefficient itself was already known; people had already worked on it: Guttman had worked on it, Richardson had worked on it, but the contribution was that Cronbach made it more understandable for a normal person. He gave it an interpretation which was clearer, gave it a name which was understandable and so forth. That helps if you want to get ideas across. We psychometricians have the tendency to be a technical kind of people and write everything in such a way that everything is correct, but sometimes we do not have the flair that Cronbach had. Maybe I shouldn’t say that Cronbach’s alpha was the most important idea; “important” has this connotation that it generates more ideas. As I mentioned, Jöreskog was also very important. I hate to say it, because it’s not my specialization. It is something that we can be proud of and extend beyond psychometrics, hopefully more so in the future. And in 50 years or so, what is your own legacy? Yes, well I don’t think it will be a paper that is already published! My best paper is still to come! I think I might be remembered because I was the editor of two journals, that type of thing might stay alive. Maybe the Gifi book might survive, which was our collective work on multivariate linear multivariate analysis. I was one of the editors on that together with Jacqueline Meulman. But as I told you before, the definite paper about unfolding is still in my head. It’s coming for sure. I won’t die before it’s finished. I’ll come back to you when that moment arrives! Are there scientific fields that you think psychometrics can learn from? Do you envy certain scientific fields? I envy biology. It is the century of biology, and everyone’s looking at biology; even psychologists are doing biology, which is stupid of course! But objectively speaking, biology is important. Intellectually, I’m not so sure. Intellectually, I think psychology is still inspiring; it’s an intellectual challenge. The century of psychology was announced early in 1913 by a Dutch psychologist, Gerard Heymans. He said the twentieth century will be the century of psychology, which
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. 11
15 Willem Heiser
211
more or less came true, because now we have a lot of psychologists in the Netherlands, but I don’t think we managed to have such a big influence on society as we would have liked. Since psychology is still a challenge and there is still a lot of work to be done in that field, do you think it is also still an exciting field? Yes! I have nothing with physics, I don’t care about law! I do care a little bit about ethics. There are all kinds of disciplines I am interested in, but envy is a strong word. You’re happy where you are. Well, a difficult dilemma is the thing of specialization versus interdisciplinarity. There is a lot of talk about interdisciplinarity nowadays. But I don’t think interdisciplinarity can exist when you don’t have a discipline. And both psychology and statistics are already in my view very interdisciplinary. My story about the little rivers that come together in the big river shows that statistics is interdisciplinary. Psychology itself is also scattered: there are many different approaches and completely different terminologies even. So, we’re still busy finding the right level of specialization. So, I advise young people to specialize in something, so they can really write papers on the topic, and not only do interdisciplinary stuff because then they end up nowhere. Is that also a way of saying that psychometrics as a discipline should in that sense be saved, should be cherished? Yes! I consider the strongest point of psychometrics the interest in individual differences. If you look at it from the point of view of statistics, it means a certain type of models in statistics allows you to study individual differences. This means we always have an inroad in statistics, because this is one of our specializations, and individual differences are also important in biology, for instance, and even maybe in chemistry. So this whole idea of individual differences is one of our key notions, and then of course the second typical property is that we’re very good in categorical data, binary but also multiple categorical data, and ordinal data, which are not so prominent in statistics in general. Ordinal data, for instance, is a very little niche in statistics, while on the other hand in the behavioral sciences, categorical data and ordinal data are everywhere. We should certainly keep our key knowledge. And then I’ll come to my final question. I’ve heard several plans that you sort of still have for the future. That’s good to hear. First of all, your best paper ever… … will be written in the future, yes.
212
15 Willem Heiser
Is there anything else you’re currently working on? I also have a dream about writing another book, and the topic there will be something to do with chance. I think the way people think about chance and random phenomena is too limited. I also think there’s a psychological angle to chance. I think it’s a human attribution: it’s not a physical phenomenon, it’s something human. So that’s also a new project: to write a book, which gives a psychological angle to chance. In probability theory or in statistics, there’s a very limited notion of randomness, and there used to be a big discussion about the different types of probability you might have, like objective probability, subjective probability, and so forth. But that all became irrelevant now that we even have Bayesian statistics everywhere. The conceptual issues about probability and chance are still of interest to me, and I’ve never worked on that before. So that’s a project for the future. Thank you very much, Willem, for this interview. You’re welcome.
Chapter 16
Ulf Böckenholt
“We live in the age of big data, the age of self-quantification. I wear a Fitbit watch. Self-quantification is a dream come true for a psychometrician!” Ulf Böckenholt is the John D. Gray professor of marketing at the Kellogg School of Management at Northwestern University, Illinois. He was president of the Psychometric Society in 2005. Ulf Böckenholt earned his Ph.D. at the University of Chicago in 1985 under the supervision of R. Darrell Bock. His main research interest is the development of statistical and psychometric methods for judgment and decision-making.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_16
213
214
16 Ulf Böckenholt
Thank you, Ulf, for participating in this project on the history of psychometrics. I’ll be asking questions about your career; about the relationships of psychometrics with other fields, such as psychology or statistics; and your view on the history and future of psychometrics. And I always start with the question: how did you end up in psychometrics? I ended up in psychometrics almost by serendipity, as it often happens in life. I was studying psychology in Germany and applied for a fellowship to study abroad. One of the requirements was that you had to specify where you wanted to study. I knew just a thing or two about the University of Chicago, thanks to Werner Wothke—an instructor of mine in Germany. He recommended it, so without much further thought, I chose the University of Chicago as a place of study. Fortunately, the fellowship was awarded to me, and a few years later I received my Ph.D. under the supervision of Darrell Bock. I was interested in psychometrics before I came to the University of Chicago, but it seemed out of reach to me to contribute to this field. But the training I received and my interactions with Darrell made me realize that my passion for psychometrics did not need to be latent. Was the German training in psychology very different from American psychology or your Ph.D. research? Yes, quite a bit—on several fronts. In Germany, I could not take many graduate- level classes because the emphasis was on undergraduate training. Also, many of the departments did not offer classes to students outside of their discipline—as a psychology student, it would have been very difficult for me to take classes in econometrics or sociometrics. And, finally, in my days there was little emphasis on making research an explicit part of the training. All of this was different and for the better at the University of Chicago. What about psychometrics appealed to you? Looking back I think that already in my studies in Germany, I was mostly interested in formal approaches towards human behavior. Such topics as auditory psychophysics or models of human problem-solving intrigued me sufficiently that I even decided to study them on my own and to pursue them in my master thesis. I liked how formal methods allowed for rigorous tests and precise predictions and, importantly, the fact that I could program primitive versions in FORTRAN. But my true interest in psychometrics became only full-blown at the University of Chicago when I could see how useful psychometric methods were in capturing human differences. What inspired you so much at the University of Chicago? Perhaps because of the contrast with my studies in Germany, I have very fond memories of my stay at the University of Chicago. It was such a liberal place. I could steer my path in almost any way I wanted: aside from the fantastic training I received in the quantitative psychology program, I could take classes with Leo Goodman,
16 Ulf Böckenholt
215
who was a leader in categorical data analysis in those days, and with Shelby Haberman, who is a very well-known statistician. Actually, at that point, Haberman had just completed his books on categorical data analysis.1 There were classes on missing data analysis with Don Rubin, on time-series analyses with George Tiao, on econometrics with Arnold Zellner, and so on—everything that I was interested in was available, and you’d learn it from the very best people in the field. Frankly, it was an awesome experience for me. What type of research did you do in that time? I spent a couple of years taking classes, and then at some point, I had to take a qualifying exam, which was part of the requirements of my program. I also needed to come up with a topic for my thesis, and I would say that I approached it in a very pragmatic way: I saw it just as a requirement I needed to pass, so I extended a topic that I encountered initially as part of a research project under the supervision of Darrell Bock, which was a market research project for Hershey’s. Hershey’s, a chocolate and candy company, was interested in creating a product that maximized preferences for its ingredient mix; so the question was how much chocolate, how much peanut butter, how much sugar, etc., and in which combination, should go into this product. This is actually an interesting psychophysical and statistical problem: there is a response surface that captures how preferences change as a function of the ingredients, and the goal is to find the optimum, the point of maximum liking. And you also want to describe the uncertainty in estimating this optimum, because that gives you a bit of flexibility: do I have to use exactly 10 grams, or can I vary between 5 and 15 grams and still get a statistically indistinguishable estimate for the maximum preference? This kind of statistical problem can be called an “inverse prediction problem”: you have a maximum, and for that maximum, you need to predict the mix of ingredients that lead to this maximum and then get uncertainty estimates for the ingredient combinations. That was my starting point, and then I found some other applications; all of these were so-called inverse prediction problems, where the goal is to predict what the values of the independent variables are that satisfy certain features, such as an optimum or a threshold. Initially, I started with the Hershey dilemma, and later on, I put the chocolate world aside and instead worked out the statistical properties of this issue. I extended that work in several chapters, looking at it from different angles, and that turned out to be my thesis.2
Haberman, S. J. (1978). Analysis of qualitative data. Volume 1: Introductory topics. New York: Academic Press. Haberman, S. J. (1979). Analysis of qualitative data. Volume 2: New developments. New York: Academic Press. 2 Böckenholt, U. (1989). Analyzing optima in the exploration of multiple response surfaces. Biometrics, 45, 1001–1008. 1
216
16 Ulf Böckenholt
To me, this doesn’t sound like very typical psychometric; do you consider your thesis as a psychometric thesis? Maybe not mainstream psychometrics, no. These were applications of statistical theory that you can utilize in a psychometric and marketing context, but I wouldn’t call this topic mainstream psychometrics. And yet, it’s a study of preferences, and to me, the quantification of preferences belongs to the class of psychological measurements. At Northwestern University, you’re working in the Marketing Department; has that become more your niche than psychometrics? Or do you consider it to be similar? The link between psychometrics and marketing is quite strong. Understanding and predicting preferences, as well as drivers of individual preference differences, are major research topics in marketing. Going back to these initial years of mine studying psychometrics, there’s one major source of influence, which was a book by Darrell Bock and Lyle Jones, about the measurement and prediction of judgment and choice3 that triggered my interest in marketing. It’s a fantastic book, but it’s not a book that many people know about. This book greatly affected my research interests. In the beginning, I worked extensively on comparative judgments and different methods to elicit comparative judgments. Comparative judgments are different from absolute judgments. One example of absolute judgments are ratings, which are a standard topic in IRT, but in comparative judgments, you’re asked to compare multiple objects with each other and pick which one you like more or which one is stronger, say, with respect to a specific attribute. Some psychometricians were attracted to comparative judgments, but mostly in the context of the unfolding model, which is a model that links preferences to attributes. I was interested in the unfolding model, but I was also interested in extending the applicability of this model and other preferential models to a wider range of these comparative judgment data, so that’s what I worked on for some years. I developed models for paired comparison data, for ranking data, for partial ranking data, for partial incomplete ranking data, et cetera. And that’s also the theme of the book by Bock and Jones. I view my initial work as something that builds on their book and sometimes modernizes and extends it. Their book was full of excellent ideas, but, for example, the estimation technology was not fully developed in those days, so this was one part I could extend. Apart from his book, what did Darrell Bock teach you in his time as your advisor? Working on your thesis with an advisor like Darrell Bock is a humbling and forming experience. I remember, at the beginning, I understood only a fraction of his Bock, R. D. & Jones, L. V. (1968). The measurement and prediction of judgement and choice. San Francisco: Holden-Day. 3
16 Ulf Böckenholt
217
comments. But it actually inspired me and did not discourage me from pursuing whatever I was doing. Now I see that his patience and support must have played a huge role in my drive to learn psychometric methods. When he started talking about Hessian matrices, say, I would not be quite sure what this referred to … but because I wanted to understand his thoughts and comments, little by little I learned a lot on my own, so I started seeing where he was coming from. Naturally, he had a tremendous influence on my way of thinking about statistical modeling and how it should be done. I also learned in a structured way under his guidance—I took classes with him about item response models and estimation, and I learned from him how to program and estimate models, which was an invaluable experience. Do you consider your work to be very much influenced by him, or did you really choose your own path? Darrell was not the kind of person to tell you what to do. So in that regard, I think I cannot say it was his fault! My research was really about my preferences and what I found interesting and appealing, but he certainly affected the way I thought about the world from a psychometric point of view. He also affected my research style. He was ahead in terms of programming; he developed a programming language called MATCAL,4 which allows you to operate with matrices, similar to what you can do with MATLAB today. He had already developed that in the 1960s, and as his students, we were allowed to use it. The logic he used to structure problems became naturally the foundation of how I think about programming. In many ways, like other great thinkers, he was ahead of his time. Both his vision and his persistence in implementing it left a clear mark on the way I think about research and therefore on my work. Where did you go after your time as a Ph.D. student in Chicago? My next destination was at the University of Illinois at Urbana-Champaign. I was truly fortunate to join an amazing group of psychometricians including Lloyd Humphreys, Ledyard Tucker, Larry Hubert, Phipps Arabie, Stanley Wasserman, Larry Jones, and later David Budescu. This was my second “forming experience” so to speak, in terms of the breadth and depth of psychometric topics everyone was working on. To top it off, Larry Hubert was the editor of Psychometrika at that time, and he was extremely helpful with my papers and research. I read on your resume that you also spent some time in Groningen? My work at the University of Illinois Urbana-Champaign on comparative data and preference modeling evolved more and more in the direction of economics, so I took a position in the Economics Department at the Rijksuniversiteit Groningen. And probably, at this time, a part of me longed to live in Europe again. Bock, R. D. & Repp, B. H. (1974). MATCAL: Double Precision Matrix Operations Subroutines for the IBM System/360–370 Computers. Ann Arbor, MI: National Educational Resources. 4
218
16 Ulf Böckenholt
There are some statisticians or psychometricians in Groningen as well, but you affiliated more with the economists. Yes and no—my position was in the economics department, but I also spent quite a bit of time talking and collaborating with the psychometric group from the Faculty of the Behavioral and Social Sciences: Marijtje van Duijn, Herbert Hoijtink, Henk Kiers, Ivo Molenaar, Tom Snijders, and Jos ten Berge, to name just a few members. Did economists have a very different thinking style? No, not at all. Just as an example, the dean of the Faculty of Economics at that time, Tom Wansbeek, had just published a book on psychometrics in econometrics.5 Actually, much of the work in those days on modeling choices was done by econometricians. One person who stands out is Daniel McFadden, who got the Nobel Prize for his work on the logit model.6 I was very familiar with this work because of Darrell’s early contributions to this field. Discrete choice models for revealed preferences are used in many diverse fields such as engineering, environmental management, urban planning, or transportation and not just in economics and marketing. This led to many model variations that moved away from the utility maximization paradigm which appealed to me a lot. So comparative judgment is one of your main research interests; did you also have other interests? I got interested more generally in the analysis of frequency data, particularly count data, and I wrote several papers on how to analyze count data.7 They had never been looked at from a psychometric viewpoint, so I applied psychometric models to count data, including time-series data. More recently, I would say I’m actually back into modeling absolute judgments and the analysis of rating data. I focus on what I call “multiple process models.” These models do not assume that the ratings result from just one thought process, but postulate multiple cognitive processes instead. For example, in educational testing, you may think that an observed score may be a reflection of a person’s ability or of lucky guessing, so two latent effects need to be separated: guessing and the respondent’s ability. In absolute ratings, especially in self-reports, there may be multiple response processes as well. For example, people may like to present themselves favorably, but they may also want to be truthful at Wansbeek, T.J. & E. Meijer, 2000, Measurement errors and latent variables in econometrics. Advanced Textbooks in Economics, 37. North-Holland, Amsterdam. 6 McFadden, D. P. (1974). Conditional Logit Analysis of Qualitative Choice Behavior, In P. Zarembka (Ed.), Frontiers in Econometrics (pp. 105–142). New York: Academic Press. 7 Böckenholt. U. (1999). INAR(1) Poisson regression models: Analyzing heterogeneity and serial dependencies in repeated count data. Journal of Econometrics, 89, 317–338. Böckenholt, U. (1999). An INAR(1) negative multinomial regression model for longitudinal count data. Psychometrika, 64, 53–67. Böckenholt, U. (1999). Analyzing multiple emotions over time by autoregressive negative multinomial regression models. Journal of the American Statistical Association, 94, 757–765. 5
16 Ulf Böckenholt
219
the same time. So they need to manage those two processes when answering an item. In analyzing self-reports, you have to consider these potentially conflicting goals that may drive the judgments. I’m trying to come up with ways that allow figuring out which process is dominant in leading to a particular response. This will allow us to move away from the not-always-realistic assumption that self-reports are unbiased, candid, and accurate and improve the quality of information that can be extracted from self-reports. That sounds more related to psychometrics related than maybe your earlier work. You’re the judge! Well, maybe you disagree with that. To me, these are just labels. My work on frequency data and my work on comparative judgment data are about human judgments. The different data types result from the many ways that are available to elicit responses. Depending on the context, they can be frequencies, ratings, comparisons, and so on. Although the response modes may differ, the underlying traits or states may or may not depend on them. I think psychometric tools can be very helpful in finding this out. When you studied psychology in Germany, you were also interested in cognitive psychology. Are there still traces of that in your work? Absolutely! One strand of cognitive psychology is about dual-process models that assume that thoughts and behaviors are a result of multiple perhaps even competing psychological processes, and clearly, this is very interesting to me. That’s a source of inspiration when developing psychometric models. Do you consider yourself a psychologist? Yes, I do. I follow the literature in psychology, I publish in psychology journals, and much of my work is informed by psychological theories and findings. So you’re not as focused on the marketing side as you are on psychology? Marketing is very much an interdisciplinary discipline, with a very heavy psychology component. If you look at theories of consumer behavior, I would say that mostly they are based on psychology. Other, perhaps less dominant, theories come from sociology, economics, and now also from neuroscience. In addition to the branch of consumer behavior, there is also quantitative marketing. Quantitative marketing is more influenced by economic thinking, like game theory and structural modeling, which are areas that I also find quite interesting.
220
16 Ulf Böckenholt
At IMPS, I think much of the research that is presented there is IRT related or SEM related. Is that your experience as well? Do you feel like you only represent a small group in psychometrics? These days, IRT has become quite cognitive: the diagnostic models are based on mixture ideas, but they also try to link test responses to cognitive factors, to knowledge representations, or to skill levels. I love attending IMPS meetings—they are always stimulating and fun. And the breadth of topics is also not narrow because the meeting continues to attract people from many different areas. This is particularly apparent for the international meetings. Do you think that psychometrics should be more than test-related research? Yes, I do. I think there are many opportunities where psychometric theory is applicable and can play a vital role. This can be seen from the many psychometric topics that have reemerged in publications on meta-science. Do you think that will happen in the future? Yes, I am very optimistic. It will happen sooner if we can increase the number of students that are being trained and the number of places all over the world where psychometrics is a major topic in research and industry applications. Do you also work with other applied psychologists in your research? Of course! I have several ongoing research projects with my behavioral colleagues at Kellogg. And, currently, I supervise a thesis on “price fairness”: when do people decide that a price for a product or service is fair as opposed to not fair? That has a strong psychological component to it, but there are also huge measurement issues that need to be tackled. Some consider the worlds of psychology and psychometrics to be separate. Do you think that’s a problem? I actually think we are moving in the right direction in that regard. When I look back, psychometrics has had quite a significant influence on psychology, and nowadays I think this influence is less apparent, but I would not say this is a fault of either side. My view on this is that psychometricians develop tools, and those tools can influence and perhaps even be turned into theories. At some point, factor analysis was utilized as a theory in personality or motivational work, and psychologists also developed theories that influenced how psychometricians approach a problem, so I think the current work on diagnostic testing is certainly influenced by cognitive theories in psychology. What psychometricians could do perhaps more of is try to come up with tools that in some sense influence how psychologists theorize. So I mentioned factor analysis already, but another tool that I think was quite influential, though also quite specialized, is the so-called multinomial processing tree models.
16 Ulf Böckenholt
221
They were developed by Bill Batchelder and David Riefer.8 It’s a simple application of categorical data analysis, somewhat specialized, but it had a notable influence on how researchers test memory representations and other cognitive theories. These connections show to me that the link between psychology and psychometrics works, but we just need to do more of it. I don’t see why psychologists would not value or utilize the work of psychometricians more. More generally, I think when tools are developed, it’s very meaningful to think about how that would influence a psychological theory. A new tool usually means you use data in a different way which enables you to extract more information from data. The whole notion of individual differences, the basis of factor analysis, had a tremendous influence on how psychologists thought about personality theory. Another example that comes to mind is multilevel modeling. I wouldn’t call multilevel modeling a psychometric tool per se, but psychometricians were quite influential in the initial stages of multilevel models, and they were also influential in the theory-building of contextual effects, not only in psychology but also in sociology and other related areas. So these links between psychology and psychometrics are very clear, and in the case of factor analysis, they have worked for a long time, which probably makes it the most influential psychometric technique that has ever been developed. You’re familiar with economics, you’re familiar with marketing; are these fields psychometrics could learn from? Do other fields have techniques that psychometrics could use? Yes, of course. And I think that—none of this is a secret, psychometricians are well aware of this—for example, estimation techniques which are developed in other fields are frequently used. The field of Bayesian statistics or Bayesian thinking was embraced by many psychometricians. I think that instead of thinking more about “what can I do with my current data,” it would be useful for psychometricians to think about how to get the data that you need, to answer a particular question. As an example, I can of course postulate that when people answer an item, they have a different response strategy for each item, and they switch from one response strategy to another response strategy across items. I can try to fit a model that accommodates this particular process which would be quite complicated. The issue is though that this model will run into identifiability problems: the data may not have the information to the degree that is needed by the model. And that tells you that richer data are required, additional data of different types. If I would criticize something about psychometric work, I would say that people think too much about the current types of data they’re having and they think less about other types of data that could nowadays be easily obtained, in which case they wouldn’t have to make these complicated modeling assumptions, but identify the conjectured processes in a much more straightforward way. So, the notion of trying to squeeze data as much as Hu, X. & Batchelder, W. H. (1994) The statistical analysis of general processing tree models with the EM algorithm.Psychometrika, 59, 21–47. 8
222
16 Ulf Böckenholt
possible to extract whatever one would like to extract is outdated, and instead one should become much more creative in utilizing the many other sources of data that one has nowadays access to. Can you name a couple of these sources? We live in the age of big data, so we have textual data, video data, and digital data. Coming back to this problem that I posed earlier, “do people switch their response strategy across items?,” one way to make progress on that question is simply to record the eye movements in addition to the response. What do the respondents look at? Do they read the item? How much time did they spend looking at the previous item? How much time do they spend looking at the response scale? This is additional information that can be quite helpful in seeing whether there is a contextual effect where previous items affect current answers or whether people are only thinking about the current item and don’t quite know what to answer. With these additional data sources, it’s certainly easier to characterize what’s going on in a person’s mind. Do you think that’s where psychometrics is headed? That’s where it should be headed, in my view. We live in the age of big data, the age of self-quantification. I wear a Fitbit watch. Self-quantification is a dream come true for a psychometrician! If only we had a Fitbit for our minds! I always ask some questions about the history of psychometrics. What do you think is your own legacy? What do you think is your most important work that you will be remembered for? I let others decide about my legacy. Do I want to identify one piece that needs to be read? Not really, I think Google search has eliminated this problem. I know which of my papers are cited more heavily than others. My most cited paper9 has no psychometrics in it whatsoever; it’s about how physicians make medical decisions.10 Well, it has psychology in it! Yes! My second most cited paper at the moment is a tutorial on how to bootstrap,11 which was written for an applied journal, Psychophysiology. It has no theory in it
The statements about the most cited papers pertain to the date when this interview was held, which was in May 2017. 10 Weber, E. U., & Bockenholt, U., Hilton, D. J., & Wallace, B. (1993). Determinants of diagnostic hypothesis generation: Effects of information, base rates, and experience. Journal of Experimental Psychology, 19, 1134–1150. 11 Wasserman, S., & Böckenholt, U. (1989). Bootstrapping: Applications to psychophysiology. Psychophysiology, 26, 208–221. 9
16 Ulf Böckenholt
223
that is new or original, but it illustrates the use of the bootstrap in areas that psychophysiologists find very important. It has the benefit of being the first paper on bootstrapping in this particular discipline, and non-methodologists find this paper helpful. Who do you consider the most important psychometrician? Psychometrics is quite lucky in the sense that they had multiple people at the beginning with rather different strands of work. There is Francis Galton who coined the term psychometrics; Charles Spearman, who designed factor analysis; Harold Hoteling, who developed canonical correlation analysis and multivariate statistics; Gustav Fechner; Sewall Wright; and a long list of people from different disciplines. Their foundational work influenced multiple generations of psychometricians, so I don’t think there’s a single parent. This combination of multiple topics from different disciplines proved to be very powerful for psychometrics and the development of psychometric tools. Whose work has influenced your work the most? My thesis supervisor Darrell Bock. I think he has influenced my thinking the most. Next are Larry Hubert, Larry Jones, and Phipps Arabie, who were my colleagues at the University of Illinois. That was a very productive time. Later I got to know Jim Ramsay and Yoshio Takane well when I worked at McGill University, both of whom I admire greatly, because of their creativity and ingenuity. These are the people I would list as my key influences. And what do you think is psychometrics biggest achievement? I think the biggest achievement is that psychometrics took the notion of a latent variable seriously and went with it. It also proved to be a huge topic for many researchers, with many interesting products. Factor analysis, structural equation modeling, item response modeling, missing data, meta-analyses, multi-level modeling, cluster analysis, and latent class analysis: in one way or another, they all relate to the notion of a latent variable, and the models that evolve from that are here to stay. If I would have to point to one idea, one concept, that’s the one I would pick. And what about the latent variable theory that makes it so special? It changes the way you think about models. The idea that observed correlations are caused by an underlying latent variable process is extremely appealing, both to psychologists and to psychometricians. It enriches the notion of “you don’t measure just what you observe, but you may also measure what you don’t observe.” This may be a simple idea, but it has been proven to be very influential.
224
16 Ulf Böckenholt
I wonder what you think of the latent variable: do you believe it’s an existing entity? When you model a latent variable, nowadays, what do you believe it is? Do you mean the causal status of a latent variable? I think if you introduce a construct of this nature, you should take it as seriously as you can, so it shouldn’t be introduced because of convenience, or a lack of alternatives, but it should be introduced because ultimately you expect it to be something meaningful. What do you think is the biggest challenge for psychometrics? I think the biggest challenge for psychometrics is to make its case, to demonstrate again and again that better psychological measurements can be obtained and matter in prediction, in theory testing, in facilitating progress across the social sciences. I don’t think it’s enough to just say that it matters. We have to make the case so that people become more aware of the significance of better measurements and also more critical of poor measurement practices. So we have to keep working on improving and developing measurement models and on exploring new ways of measuring human behaviors. That’s the biggest challenge in my view. The significance of “better measurements” has become even more salient now because it is so easy these days to collect large data sets. How should psychometricians go about this? The obvious answer is that psychometricians have to continue looking for opportunities to demonstrate how better measurement can be obtained in different disciplines and how it can be achieved by reducing both systematic and non-systematic sources of measurement error. It’s up to the psychometricians to innovate and to show how to do that; the academic and also the societal benefits of this work are tremendous, and if we do it right, the importance and role of psychometrics can only increase across the social sciences.
Chapter 17
Paul De Boeck
“Modeling doesn’t necessarily have the purpose to measure, but if you have a good model, measurement follows automatically.” Paul De Boeck is professor of quantitative psychology at Ohio State University. He was president of the Psychometric Society in 2007. De Boeck earned his Ph.D. at Catholic University Leuven in Belgium in 1977 under the supervision of personality psychologist Willem Claeys. His main research interests are individual differences and explanatory measurement. Recently, De Boeck has taken up parallel modeling of response and response time data.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_17
225
226
17 Paul De Boeck
Paul De Boeck, thank you for being here today. Thanks for your participation in this oral history project, about the history of psychometrics. I’ll be interviewing you today about three topics: first, your own personal career; second, the interaction between psychometrics and other fields such as psychology or statistics; and third, the history and the future of psychometrics. The first question I wanted to ask was: how did you end up in psychometrics? A long time ago, I was a student in psychology. At the time, there was a new program that was organized for mathematical psychology. That was the year after Luc Delbeke, a newly appointed faculty member, came back from a one-year visit with Clyde Coombs, so he started the program in mathematical psychology. I was one of the students of the first class, together with Norman Verhelst and Marcel Croon. But mathematical psychology at that time was and still is quite different from psychometrics. I got a master in psychology with a thesis on psychophysics. For my Ph.D. research, I was intrigued by personality. I did my Ph.D. with Willem Claeys, a personality psychologist, as my advisor. I was actually doing research to show the weaknesses of tests. My point was they were not something we needed. It was a critical dissertation, and I was not really interested in measurement. I was going to become a researcher in the field of personality with a kind of dislike for personality tests. After graduating I was so lucky to obtain a tenured research associate position—I don’t know whether the kind of position still exists—which was like a flat career; there was no way to become a professor. I had a lower salary and also no students. I was actually quite satisfied with that position. But then the professor of psychological assessment retired. I did not apply because I was not interested in assessment, but the dean came to see me and told me that the people who applied didn’t really have a good CV and that I should apply since I had some publications and so on. This was my chance, he said. I applied after the deadline, and I was appointed as an assistant professor for psychological assessment. I didn’t like this too much; it was certainly not my expertise, so I looked around in the literature for something I wanted to do, being a professor of psychological assessment with a mathematical psychology interest. And that’s why I got interested in psychometrics, because it is statistical modeling for psychological assessment. That’s how I became a psychometrician: not because I liked measurement, but because it was the best compromise between my teaching duties in psychological assessment and my interest in mathematical psychology. From there on, I think the compromise worked quite well, and Leuven is still a mixed group. I think it’s now more a statistical modeling/mathematical psychology group than a psychometrics group. People sometimes consider me a psychometrician. I would respond by saying: “I am a non-measurement psychometrician.” I’m in the first place interested in understanding test responses and human behavior. Statistical modeling is a way to do psychometrics, and I’ve learned from two Dutch mathematical psychologists, Tom Bezembinder and Eddie Roskam, that measurement is a spin-off and not the primary purpose. You should not first measure; measurement comes automatically if you have a good model for the data. Modeling
17 Paul De Boeck
227
doesn’t necessarily have the purpose to measure, but if you have a good model, measurement follows automatically. That was their view, and that’s also my view. You can stop after the modeling and say: I’m not interested in the measurement of intelligence; I want to understand how people solve items in intelligence, which is something I learned from the early work of Robert Sternberg.1 I don’t need to know how smart people are, I don’t need to measure people’s IQ, but I want to know how people work on problems in intelligence tests, and then I’m done. If you are interested in measurement, that’s fine; I can give you the numbers. What you’ve been telling me so far reminds me more of mathematical psychology than of psychometrics. Yes, but there is a difference: there is nothing especially psychometrical about psychometric models; they’re just statistical models, but different from models in mathematical psychology. So, the type of model is different. I usually work with IRT models, which is of course considered psychometrics, but these IRT models are just regular statistical models. Biostatisticians use the same models, but they are not “measuring.” They are modeling the data; that’s actually what I like. What I learned, and correct me if I’m wrong, is that mathematical psychologists make a model that is meaningful. Every parameter has a specific meaning, so, for example, there’s a parameter in a model that denotes a specific cognitive capacity. Do you prefer the more traditionally psychometric models? Some of the things I’ve been doing is to turn psychometric models into models where you actually do have a psychological meaning for parameters. One of the things we did was to come up with an interpretation of the 2PL model in terms of the diffusion model.2 And there are other examples, like “explanatory measurement,” as we call it.3 You bring in covariates to explain ability, the item difficulty, and even the item discrimination. Explanatory models come with more meaning than just measurement. So, I do share the common interest of mathematical psychologists, which is to work with meaningful parameters that not necessarily refer to measurement, but which can be interpreted in a psychological sense. I guess that’s the case because of my early training in mathematical psychology.
Sternberg, R. J. (1977a). Intelligence, information processing, and analogical reasoning: The componential analysis of human abilities. Hillsdale, N.J.: Erlbaum, 1977. 2 Tuerlinckx, F., & De Boeck, P. (2005). Two interpretations of the discrimination parameter. Psychometrika, 70, 629–650. 3 De Boeck, P., & Wilson, M. (Eds.) (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York: Springer. 1
228
17 Paul De Boeck
That makes sense! Earlier, you mentioned Clyde Coombs. I’ve never met Clyde Coombs, but the professor who started the program in mathematical psychology in Leuven got his training from Clyde Coombs. Data theory is something I still work with. I use the ideas of Clyde Coombs almost every day. I’ve never met with Clyde Coombs though. He is maybe an intellectual inspiration to you. Yes. I know that one of the questions is: “did you have an inspiration?” I do like his ideas very much, and I read his book A Theory of Data.4 Unfortunately, students these days don’t know anything about it. I think that book gives you a great frame of thinking. And psychometrics is quite different from data theory. It’s difficult to tell where ideas come from. One of the things I like very much, like I just explained, is measurement as a spin-off, as a by-product: you can have it if you want, but you don’t need it. That idea is from Eddie Roskam and Tom Bezembinder and actually goes back to measurement theory. You can only measure if you have an understanding of the phenomena; you cannot measure temperature if there are no laws that make use of the variable temperature. You first need to learn your physics and then you can measure. You cannot just start measuring if you don’t know what it’s about, and that’s what we do so often in psychology and psychological measurement. We don’t really know what we’re measuring, but we’re measuring anyway. There have been many years that I avoided the term “measurement,” but now I do use it. Here in Europe I was able to avoid the term altogether, but in the United States they won’t understand how IRT would not be focused on measurement. But see, I don’t think we measure when we measure ability; I don’t know what ability is, so I cannot claim that I’m measuring an ability. Are you searching for what abilities are? No, not really. It’s not a high ambition, but I try to understand the data. Ability refers to an unknown source of individual differences, but that’s perhaps not the most interesting part. There are perhaps other aspects of the data that I can understand better. My primary interest is “understanding,” and that is why I am in the first place a psychologist. All these subdisciplines of psychology are trying to understand human behavior. They are not too much interested in measurement; that’s only the boring part! They are interested in understanding human behavior. I happen to be a psychometrician who is interested in human behavior, and I believe my substantive colleagues in developmental and social psychology don’t see how psychometrics can contribute to the understanding of human behavior; it can only contribute to the measurement part. Unfortunately, psychologists have to measure now and then, and that’s why they need psychometrics, but in my view psychometrics has so much more to offer with regard to understanding human behavior.
Coombs, C. H. (1964). A theory of data. New York: Wiley.
4
17 Paul De Boeck
229
Can you explain one of your own studies in which you try to understand human behavior through psychometrics? There is this famous example that some believe is overused (that’s what reviewers tell me at least). Long ago, we did a study on verbal aggression.5 We had a test on verbal aggression, and we measured the verbal aggressive inclination of people. My interest was why people are verbally aggressive. So what we did—this was not an original idea, this was published first by Susan Embretson6—is use the idea of a test design. A test can be considered a “within-person experiment.” Because a person gives repeated responses, all tests are repeated measure designs. So, why should we not make use of the design and manipulate factors in the design? I don’t see the difference between an experiment and a test. A test gives you data, and so does an experiment, and because in a test one uses repeated measures, it is possible to manipulate certain factors. Because a test consists of a set of items, you can manipulate the items to create a certain state in the person taking the test, such as stereotype threat or frustration, which can be induced to a different extent depending on the item. For example, you can manipulate whether or not the source of frustration is yourself or that it’s the fault of the interviewer. The interviewer can ask me stupid questions! Let’s hope not! I can be frustrated because I prepared for this interview and could’ve done much better, or because you ask me the wrong kind of questions, then you are to be blamed. It’s interesting to look at what the difference is depending on the source of frustration or depending on the type of verbal aggression. Let’s say there are expressive verbal aggressions, like shouting “AAH!,” which is not necessarily accusing someone but just an expression of frustration. If I would say “damn you!,” then I’m actually accusing you of something; so the question is: does that make a difference in a test? This whole story was actually meant as a prototype example to explain the methodology of explanatory measurement. Unfortunately, most tests don’t come with a design; it’s just a bunch of items. That’s so sad, because all these respondents spend so much time on doing all these tests, and these tests are an opportunity to do psychology. Now, tests are considered just as measurement tools. I don’t like the word “tool” either; test data is just regular data, just like experimental data. A more recent example is depression. A general research hypothesis is that personality disorders and affective disorders are perhaps more extreme forms of the normal range of emotions or the normal range of personality. To study affective disorders, I thought: why not use the dimensional affect core theory? This theory entails two dimensions: positive versus negative affect and arousal versus lack of Smits, D. J. M., De Boeck, P., & Vansteelandt, K. (2004). The inhibition of verbally aggressive behavior. European Journal of Personality, 18, 537–555. 6 Embretson, S. E. (1985). Test design: Developments in psychology and psychometrics. New York, NJ: Academic Press. 5
230
17 Paul De Boeck
arousal. If you look into the depression literature on the PROMIS item bank7 and other inventories, the theory is that there is a dimension for depression and another for anxiety. They’re pretty much related, in the affect core dimension space; they’re not two independent dimensions. I think depression is conceptualized as negative affect and lack of arousal, and anxiety is conceptualized as negative affect and a lot of arousal. This week, a student fit a so-called bipolar model on a depression inventory, with a negative versus positive (bipolar) dimension and an arousal versus lack of arousal dimension, with explanatory covariates for the item discriminations. That actually works better than the traditional models for depression and anxiety. She used a covariate for the loadings, so the loadings are made a function of the coordinates in the affective core space. That works well. The reason why I was interested in this topic was not to measure depression or anxiety, but to test if I can understand the test data from the perspective of affect core theory. And the answer is I think “yes.”8 In these interviews, I’m really interested to hear about the relationship between psychology and psychometrics. When I hear you speak about your research, about what psychometrics is, about the role of psychometrics, I’d say that you still consider yourself a psychologist. I think so yes, but the problem is, I don’t believe in so many things in psychology. And that’s not because of the replication crisis. It’s perhaps because I’ve been doing quantitative psychology for so many years; it seems as if many things in psychology are inspired by a need to surprise. Authors want to surprise the reader. I’m very interested in psychology, but so many things that psychology is identified with, I’m not really interested in. However, each time when I do psychometrics—or almost every time—I can get attracted by technical problems as well as the topic itself. There is still a need or interest in trying to understand how psychology works. And in that sense, I am a psychologist. I would object against considering psychometrics as a tool which is how it’s mostly considered in psychology departments. I think psychometrics is a way of thinking about substantive issues, and it’s possible to come up with substantive ideas, based on a certain way of understanding psychometric models. Many of my colleagues don’t look at psychometric models the same way I look at them. I can imagine. Have you received considerable criticism on your work? Lots of papers that I submit have been criticized. I don’t know whether they like the explanatory way I think about psychometrics; the field is dominated by measurement. I often get the question: “is this useful?” I don’t think that’s always a good Cella, D. et al. (2010). The Patient-Reported Outcomes Measurement Information System (PROMIS) developed and tested its first wave of adult self-reported health outcome item banks: 2005–2008. Journal of Clinical Epidemiology, 63, 1179–1194. 8 Xu, M., De Boeck, P., & Strunk, D. (2018). An affective space model for depression and anxiety. International Journal of Methods for Psychiatric Research, e1747. 7
17 Paul De Boeck
231
question; it’s not a necessary condition for research. When the reviewers criticize my work for being not clear enough, I can see that that’s indeed important. But when they criticize me for lack of usefulness, that’s only from their perspective. If you work for a testing company or if you have to make decisions about students, then I can fully understand that you go for an optimal kind of measurement and that understanding is less important. But I don’t think one should require that I am useful in that sense. Why is trying to understand human behavior not good enough?! The only criterion should be quality, and not that it is useful for a specific purpose. That’s the kind of criticism I don’t like, to which I would object, and unfortunately (for me) the field of psychometrics is almost exclusively about measurement. So not only the testing agencies are critical of the lack of usefulness but also the psychometricians. I can see the meaning of their work. The funding for people at universities is also dependent on what is considered useful and what test companies do, and that’s fine, but that shouldn’t be the dominant mode of a scientific discipline. Why not allow for a larger variety of psychometrics, like non-measurement psychometrics. We’ll spread the message! Have you ever had considerable doubt about the direction you took? I think I’m schizophrenic; the ultimate criterion is whether I’m having fun or not. If you would ask me is this the right way to go ahead, then I wouldn’t know; that’s an impossible question. In that sense I’m schizophrenic. I don’t think my work is important. You don’t think your work is important? I don’t think so; it’s just fun for me. When people would say that the approach I take is not the optimal approach, I would disagree. But it’s not that important either; I’m just doing what I’m doing, and if other people don’t like it, that’s perfectly fine. I would defend my opinion, and I’m of course in favor of my opinion, but it’s not too important. Think ahead 100 years: the things we consider important now probably wouldn’t be talked about anymore by then. Meanwhile I do my best and try to have fun in my work. So, if you’d think ahead, where do you see psychometrics in 50 years? I have these discussions quite often with my students and colleagues. I believe that psychometrics is just a way of statistical modeling, combined with an interest in psychology, and if you want to measure, these statistical models give you the opportunity to do so. For me, a latent variable is a random effect in a statistical model, period. I don’t have metaphysical questions, and I don’t care what a latent variable is. It is a random effect that I need to model my data. Psychometrics is statistical modeling in the first place. Sometimes there are practical problems, like adaptive testing, which are important in certain circumstances, and I don’t think statisticians
232
17 Paul De Boeck
are going to solve the problems that come with this engineering. See, there is a part of psychometrics that is engineering. I can see the fun of engineering as well, but engineering does not necessarily contribute to a better understanding of human behavior. Engineering psychometricians contribute to solving practical problems in the first place, like adaptive testing, which is a nice intellectual challenge. I’m not going to do it though, because that’s not my practical problem to solve. Perhaps, if you’d identify psychometrics as trying to solve a number of test-related practical problems, there are aspects that do imply statistical modeling, but there also may be aspects that are perhaps not strictly statistical modeling, I wouldn’t know. I spoke to you before this interview, and you feel that psychologists do not always take psychometrics seriously. Yes, and that’s sometimes true for psychology journals. I don’t think the journals like IRT, and that is true for psychological assessment journals as well.9 It is the general belief of psychologists, and I hear similar things from different schools in the United States, that psychometrics in the sense of IRT is not the way to go for psychologists. There also is a divide within quantitative psychology, between people who go to IMPS and people who don’t go to IMPS, and there are many quantitative psychologists among the none-goers as well. They develop tools, factor models, multilevel models, and meta-analytic procedures, but they don’t come to IMPS. And the people who go to IMPS often don’t go to other meetings. It would be of interest to study the sociology of our discipline. I have the feeling that you maybe hope that in the future the two worlds will collide, that they will come together? I don’t think it will happen. One of the reasons is that IMPS does for more than half “educational measurement.” The kind of work is as good as what other people do in quantitative psychology, but it’s not the things that psychologists are interested in. Psychologists are interested in explanation in the first place and not so much in measurement models. I can understand that, because I have the same kind of interest; I want to explain in the first place as well. Explanation is what most psychologists want, and that’s not what psychometrics generally gives you. It is possible, and it’s what I’m interested in, but it’s not the preferred way. So, I don’t think it will happen, but it’s not too much of my concern. For example, I have been editor of the applied section of Psychometrika, and I wanted the submissions to really contribute to a better understanding of psychological phenomena and phenomena in related disciplines. That concept hasn’t really worked, and it still doesn’t work, although the applied section does well in many other respects, so I think it’s almost an impossible thing to do. People are trained differently; they have different interests. Not only do intellectual interests differ, but the environment where you work determines
De Boeck, P. (2017). Psychological testing. In W. J. van der Linden (Ed.). Handbook of item response theory. Vol 3: Applications. New York: Chapman & Hall. 9
17 Paul De Boeck
233
what you consider important. And what I try to do is an impossible thing; it’s not easy to be a non-measurement psychometrician among measurement psychometricians, so it’s difficult in that way, but also in another way, because most psychologists are not interested in complicated models, especially when it is just to measure. Do you feel you’re on an island? Yes, but the reason I don’t give up is that I like my work. The ultimate criterion is that you’re happy yourself! Whether other people like your work is actually less important, because being unhappy with yourself is not a good idea. If you look at traditional psychometrics, or measurement psychometrics as you call it, what do you think is its greatest contribution to the scientific world? What has psychometrics contributed to our knowledge now? First of all, there is standardized testing. It’s a very ambivalent thing, though. I can see the use of what CITO and ETS and other test companies do. I can see the sense of it, but it is not so much a scientific contribution; that’s more a contribution to society and therefore as important. When people ask me what I contributed to society, I would refer to my work in testing. Scientifically speaking, the biggest achievement is factor analysis, for example, Thurstone. There are certainly other important contributions, but factor analysis is something other disciplines have been inspired by. Factor analysis has become more general. Unlike factor analysis, cluster analysis was developed in other disciplines, and that is also true for multilevel modeling. These are also very important, but I would say factor models are the most important contribution of psychometrics. What would you say is your biggest contribution so far? Perhaps what I’m known for is explanatory measurement. That is I think what I’m identified with. I cannot stay with something for a long time, because I get bored, though I still work some of the time with the same approach. We had a contract to do another book on explanatory measurement, and there were so many new developments that we could easily fill a new book. When you have experience in a field, you have a better idea which topics are interesting. But I’m too curious; I like novelty too much! Nowadays, I’m into—and I think there is a big future for this—modeling parallel data. I’m currently working on models for accuracy, for correct/incorrect and response times. That’s perhaps the most important research for me at the moment. It has such huge potential, and I think we’re missing chances there. We’re now using these models for tests and accuracy of responses and response times, but there is also brain imaging data, which is another type of parallel data. I think modeling of parallel data, of all kinds, has a huge future. And as psychometricians, we have the expertise to do so. I think psychometricians have all the expertise to model parallel data, for the sake of understanding and less so for the sake to measure.
234
17 Paul De Boeck
You don’t need parallel data to measure; there are other reasons why you’d need response times (although they may help a bit to measure). There’s a much broader opportunity to model parallel data, and I think we have the expertise to do so. I see opportunities, but the issue is to get funded, because then you would not be funded for brain imaging if you don’t have the publications already. It’s a difficult start, but we do have the expertise to do so. And then we can enter other disciplines. Is that also part of the plan? Yes, but I’m afraid I’m getting too old. If I’d have to start over again, that’s perhaps what I would focus on, but I’d hope to get at least some work done. I actually started doing parallel modeling with accuracy data, response time data, and brain imaging data, but the available datasets are perhaps not sufficiently reliable on the accuracy part of the data, so there may not be enough information in the data. You mentioned that factor analysis was psychometrics’ biggest achievement scientifically speaking. Would you consider Spearman to be the most important psychometrician? I like Thurstone better. He was doing factor analysis, but not just for measurement purposes. His paper was called “The vectors of mind”10; he wanted to explain the human mind. He had both a measurement interest and an interest in understanding how the mind functions. I think Spearman was inspiring; he also tried to understand the human mind, but I like Thurstone better. By the way, Spearman came up with the speed/accuracy trade-off in cognitive psychology, before anyone else did. People don’t cite him for it, but he definitely thought of it first.11 He’s an early diffusion modeler; he also used discrimination tasks. My colleague, Roger Ratcliff, is developing ideas about intelligence. First, he was only interested in easy binary choice problems, but then he also developed an interest in intelligence. And he says that it’s really interesting to read Spearman. I think, in an objective way, Spearman may have been more inspiring than Thurstone, but I think I like Thurstone better. I think many psychologists and maybe also psychometricians suffer from some sort of physics envy; they really look up to physics, because they are good at measurement. I was wondering whether you would also think that there is a field that psychometrics can learn a lot from? Is there a field you’re jealous of? I did my master thesis on psychophysics, so I kind of like physics. But psychology is never going to work like physics. Behavior is so much more complex. We also Thurstone, L. L. (1934). The vectors of mind. Psychological Review, 41, 1–32. De Boeck, P., Gore, L., Gonzalez-Larrondo, T., & San Martin, E. (2020). An alternative view on the measurement of intelligence and its history. In R. Sternberg, Ed., Cambridge Handbook of intelligence. Cambridge, UK: Cambridge University Press. 10 11
17 Paul De Boeck
235
haven’t found a way to look at the right level of abstraction. I often use this example: when leaves fall from the tree, their path is impossibly complex. Their paths are impossible to predict. Some leaves go up because of the wind. And when you go to another tree, their paths are different yet again; leaves differ with different trees. Suppose your research agenda would be to understand the path of leaves falling from trees in the fall; well, that’s basically psychology. In physics they don’t care about the individual stories of the leaves. The only thing they’re interested in—I’m now simplifying, and perhaps physicists would contradict me—eventually all these leaves fall down, and that’s a universal law. We as psychologists try to understand the path of falling leaves. But often we don’t look at the right thing. I think physicists have a better way to isolate all the irrelevant parts of the phenomenon. That’s another problem we have. What physicists do wouldn’t work in psychology. If I’d have to start over, I’d probably look into biological psychology. But there are also so many uncertainties there; it doesn’t provide more certainty. It’s too late to change now. I don’t have the skill or a talent to be envious. Perhaps I’m too much of a realist; I know my age, and I cannot start over. I have no agenda for the future of psychometrics or psychology; I don’t care too much. My only concern is my own life, and I don’t need to have an effect or an influence. Even the presidents of the United States; who can tell us about the impact they had? I think it’s futile to think that you have an impact; you better go for what’s within your reach. Just enjoy the things you’re doing, that’s much less ambitious but so much easier to realize!
Chapter 18
Brian Junker
“I have the most fun when I don’t know what’s going on.” Brian Junker, professor of statistics at Carnegie Mellon University, was president of the Psychometric Society in 2008. Junker finished his Ph.D. under supervision of Bill Stout at the University of Illinois in 1988. Throughout his career, he has worked on a wide variety of topics, including mixture and hierarchical models and (non)parametric inference for latent variable models.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_18
237
238
18 Brian Junker
Thank you, Brian Junker, for being part of this oral history project on the history of psychometrics. In this interview I will ask you questions about three themes: your career as a psychometrician; the relation between psychometrics and psychology, or even the relation between psychometrics and other scientific disciplines; and finally your view on the history and future of psychometrics. We’ll get there slowly. So the first question I always ask is: how did you end up in psychometrics? Well, it was an accident. So many people say that! I started a Ph.D. program in pure mathematics and was actually beginning to write a dissertation on stochastic processes. The particular topic that I chose meant that the further I got into the topic, the less people there were to talk to. Until eventually, there was basically just my advisor and a couple of people at a research institute in the Alsace region in France, and that was not enough people for me to talk to. So I actually considered dropping out of graduate school altogether, because it wasn’t satisfying. But before I did, I was actually working a summer job teaching high school teachers about statistics, and the man who would later become my advisor, Bill Stout, said: “Before you drop out of graduate school, why don’t you read this paper?” The paper was one of the earlier papers of Paul Holland,1 which preceded the work of Holland and Rosenbaum on conditional association.2 And when I read the paper—it was a Psychometrika paper—I thought, number one, I can understand this, whereas it was getting difficult to understand the stochastic processes stuff. Number two, lots of people are interested in this topic, so there seemed to be lots of people I could talk to. And number three, I can probably make a contribution here. And so at that point, I transferred from the mathematics department to the statistics department, stopped working on the mathematics Ph.D., and began working on a Ph.D. with Bill Stout in statistics. So before that, Bill Stout wasn’t your advisor? No, there was another advisor. A very smart, very nice gentleman, named Frank Knight, but I was basically in over my head with Frank. And that’s how you got into psychometrics? Yes! I actually was only a graduate student in statistics for 2 years, and in those 2 years I learned a little bit of statistics and a lot of psychometrics from Bill. After my PhD I stayed at the University of Illinois as a visiting assistant professor for 2 years and taught statistics classes and worked with Bill on research. And then I Holland, P. W. (1981). When are item response models consistent with observed data? Psychometrika, 46, 79–92. 2 Holland, P. W. & Rosenbaum, P. R. (1986). Conditional association and unidimensionality in monotone latent variable models. The Annals of Statistics, 14, 1523–1543. 1
18 Brian Junker
239
went to Carnegie Mellon University, as a postdoc, and basically learned a lot of statistics by teaching it rather than by taking classes. I guess psychometrics kind of drew me to statistics, and I became a statistician, with a really strong interest in psychometrics. So what about psychometrics sparked your interest? I’ve always been interested in mathematical and probabilistic modeling of real- world phenomena, and that’s certainly a big part of psychometrics. The modeling aspect of it is very interesting to me, and it seemed clear when reading the first few papers in psychometrics that there were interesting problems that could be solved and that I could make a contribution to. I thought I’d try it for a while, and it has worked out relatively well. Bill Stout was your advisor. Wasn’t he originally also a mathematician? He was originally a mathematician. He worked in a somewhat different area of statistics, although it was related to the stochastic processes I was working on. Mathematics at that level had played out for him, and he was looking for something else to do and got involved with psychometrics, maybe 5 years or so before I came along. And so, by the time he was asking me to read this paper of Paul Holland’s, he was mainly focused on psychometric research and wasn’t doing much of pure mathematics anymore. And he was also a president of the Psychometric Society! He was a president of the society also, that’s right. So what was your dissertation with him about? It was on a few different but related topics that had to do with extensions of a paper that Bill had written about essential unidimensionality, which I think was published in 1987 in Psychometrika.3 That work had sort of established the utility of the essential unidimensionality model, and a part of the Ph.D. dissertation that I did extended that work beyond dichotomous models to models for polytomous responses in a very nonparametric way, so not with respect to particular parametric models, but certain polytomous models in general! And I also showed that you could still have things like consistent maximum likelihood estimators and things like that,4 even
Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52, 589–617. 4 Sijtsma, K., & Junker, B. W. (1996). A survey of theory and methods of invariant item ordering. British Journal of Mathematical and Statistical Psychology, 49(1), 79–105. Junker, B. W., & Sijtsma, K. (2000). Latent and manifest monotonicity in item response models. Applied Psychological Measurement, 24(1), 65–81. 3
240
18 Brian Junker
under this weaker assumption than local independence. Another part of the dissertation extended some of the conditional association work that Paul Holland and Paul Rosenbaum did, so I sort of took a couple of interesting things and found ways to push them further. And did you work on those topics during the rest of your career? Yes, though not so much the essential independence stuff. Bill actually had a big research group, and there were many people working on essential independence tests for essential unidimensionality and things like that, and I was a little bit involved in that but not deeply involved. The first part of my publishing career involved a paper on maximum likelihood estimators and related ideas for essential unidimensional models.5 But there was also some work with Jules Ellis at Nijmegen University.6 Nijmegen? Yes, exactly. That took the conditional association work of Holland and Rosenbaum and combined it with some other work from things related to probability equalities in mathematical probability and established a characterization, a completely nonparametric characterization, of fully unidimensional IRT models and other kinds of single factor models, and that was really, really fun work. It was exciting because we actually arrived at something that no one had done before, and it was exciting because of the generality of the work. It was sort of fortunate that both Jules and I were trained in mathematics because we could use those tools, and it worked out very well, and it was an example of the kind of publishing work that I really like to do, which is not so much … well, I don’t know how to describe it exactly. I have the most fun when I don’t know what’s going on. I’ve written a lot of papers, and this work with Jules is an example of that. I write the papers to figure out what’s happening, to figure out what’s going on, not so much because I want to pile onto something that’s already been done or move things a little bit, but just because there’s a question and I don’t really know what the answer is. So a lot of the times I just start writing in order to figure something out, and this was a good example of that, so that was quite fun. Have you experienced that “ignorance” in later research as well? All the time! Ignorance is my best feature. The work with Ellis is not really highly cited; it’s an example of what I think is good foundational work, but foundational
Junker, B. W. (1991). Essential independence and likelihood-based ability estimation for polytomous items. Psychometrika, 56, 255–278. 6 Ellis, J. L., & Junker, B. W. (1997). Tail-measurability in monotone latent variable models. Psychometrika, 62, 495–523. Junker, B. W., & Ellis, J. L. (1997). A characterization of monotone unidimensional latent variable models. The Annals of Statistics, 25, 1327–1343. 5
18 Brian Junker
241
work often doesn’t get a lot of citations. There’s another work that I’ve done, that is more highly cited. There’s a paper with Richard Patz on applying MCMC on IRT models which has gotten a lot of citations7 and also a paper with Klaas Sijtsma on cognitive diagnosis models which has gotten a lot of citations.8 And in their own ways, both of those papers were motivated in the same way. I didn’t know very much about MCMC, and so Richard, who was my first Ph.D. student, and I sat down, and we decided to figure out how it would work with IRT models, and that became that paper. The cognitive diagnosis paper basically arose because Klaas and I were guest editors for an issue of Applied Psychological Measurement, and we had a set of authors that we wanted to have papers from in that special issue. One of the authors found that they couldn’t provide a paper, so Klaas and I had to come up with a paper quickly to fill the issue. I had been curious about cognitive diagnosis models and had been working in nonparametric IRT, and Klaas and I had been talking about invariant item ordering and monotonicity conditions and things like that. So we basically wrote that paper, partly as an emergency, but also because it gave us a chance to explore the relationships between these rather constrained latent class models, which are cognitive diagnosis models, and the kinds of monotonicity and invariant item ordering conditions that Klaas was very familiar with. So it was another way of writing a paper to explore what’s going on. That’s cool, right? It’s the best! Can you identify the three most important lines of research in your career? I already have a hard time mentioning one! I don’t know, I guess in a way I’m kind of a dilettante. I kind of go from area to area, and if I see there’s an interesting question, then that’s what I do. I think the work with Jules is of great foundational importance. It turned out not to have the application that I would hope it would’ve had; that’s partly because I think Jules and I have moved on to other questions. There have been a couple of authors who’ve been trying to convert those kinds of foundational conditions into practical statistical tests. Bertrand Clarke rather is an example of an author who has done that. It’s been somewhat successful but in terms of practical application not so much.9 In terms of extending our knowledge and establishing a fact that we didn’t know about before, I think the work with Jules is really important.
Patz, R. J., & Junker, B. W. (1999). A straightforward approach to Markov Chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24, 146–178. 8 Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258–272. 9 Yuan, A., & Clarke, B. (2001). Manifest characterization and testing for certain latent properties. The Annals of Statistics, 29, 876–898. 7
242
18 Brian Junker
Do you mind if your work is not applied as much? I think some people really care for whether their work has a practical application, and others think, “no, I figured things out for the sake of the bigger scientific picture.” You get different pleasures from the kind of work that Jules and I did, than you do from the kind of work that Rich Patz and I did. Those are kind of really extremes. I mean the work with Jules, it’s really satisfying to discover something you didn’t know before and no one knew before. That has an intrinsic satisfaction, and so it would be nice if the cover of Time Magazine summarized that paper, but it’s not likely to happen, and there’s plenty of intrinsic satisfaction. I don’t think many psychometricians are that lucky. Yes, exactly. The work with Patz on MCMC did not have that depth; it was satisfying and fun to do in a different way, but it’s been very satisfying to see that many people thought that that was a useful way to conceptualize the application of this really important computational method to models that matter in psychometrics. So that’s been satisfying because of the many citations, whereas the Ellis work was satisfying just because it was cool. Did you always want to become a researcher? Or was that also sort of an accident? That’s a good question. I’ve been enamored with quantitative work since high school, so I’ve always been interested in things like computing and mathematics. When I started as an undergraduate, even though I had those interests, I thought perhaps I’d be a theatre major. A theatre major! Exactly. I had been involved in theatre in high school, and I thought well, this is fun, so I took some theatre classes when I first started as an undergraduate. But by the end of my undergraduate time, I actually got a letter from the advising office at the University, and the letter said you couldn’t graduate if you don’t declare a major. I had never actually declared a major. When I looked back on the courses I had taken, there were not very many theatre courses, but there were lots of math courses, so I decided I must be a math major, and that’s how I got an undergraduate degree in mathematics. After that, I really didn’t want a regular job, and I applied to graduate school, because that was fun, and I applied in mathematics because that was fun. So you just kind of follow the rope, and where the rope leads you is where you wind up. And have you ever received like considerable criticism on your work? Or was it an easy-going experience? I haven’t, I think, not really. Most of the work that I’ve done has either been fairly mathematical; in that case you either proved it or you haven’t.
18 Brian Junker
243
You’re wrong or you’re right. There isn’t really much opportunity to worry about whether you’re right in some social sense. I’m a little hard-pressed to think of particular examples. I’ve done some work that hasn’t been received as sort of practically useful, but I can’t think of any deep opposition to any of the work that I’ve done. I think I’ve been lucky or cloistered, one of the two. Considering you don’t have a background in psychology, did you develop an interest in psychology when you started working in psychometrics? I don’t have any academic training in psychology, not beyond the typical freshman course with 500 students in it, that’s really it. I have taught myself some aspects of psychology over the years, because it’s been necessary to understand the utility of the models that I’ve been working with and the statistical methods that I’ve been doing. One particular case of that is the work of John Anderson,10 at Carnegie Mellon University. He’s famous for a computational model of cognition, called ACT. I think in its current version it’s called Act-STAR, but there’s been an earlier version called Act-R and some others. And when I first heard about this computational model, I thought how much it reminded me of the kind of two-way data that is very common in psychometrics. Even though the structure of the model is very different, it really involves looking at variation across individuals as they succeed or fail at various kinds of cognitive tasks. So I tried to read and teach myself something about this kind of modeling in cognitive psychology, and actually at one time, I had a Ph.D. student who ended up not doing a Ph.D. with me but with someone else. While she was working with me, she and I developed a variation of this computational model of John Anderson’s, in which you could actually do statistical estimation of person parameters and task parameters and you could ask the kinds of questions that a psychometrician would be interested in, but then applied to this model which really came out of this very different part of psychological world. And that paper actually ended up being published; it’s a very nice paper. I’m trying to think of the journal; it’s one of the cognitive science journals where it got published.11 Anyway, when I’ve needed to know something about psychology, I’ve tried to learn it. What do you think is the relationship between psychology and psychometrics? I guess what I have to say, and maybe I’ll say it with a bit of a story, since we’re here in Asheville. Asheville is kind of a hotbed of a traditional type of folk music in the USA, which is known for “Appalachian old-time music,” a kind of dance music. Usually, the instrumentation is a fiddle, a banjo, a guitar, and maybe a mandolin, and
Anderson, J. R., Matessa M., & Lebiere, C. (1997). AT-R: A theory of higher level cognition and its relation to visual attention. Human-Computer Interaction, 12, 439–462. 11 Weaver, R. (2008). Parameters, predictions, and evidence in computational modeling: A statistical view informed by ACT-R. Cognitive Science, 32, 1349–1375. 10
244
18 Brian Junker
it’s just a very nice danceable music. And I like to play that kind of music. So I found a couple of bars around town where you can join in and play with other people, and we talk to each other between tunes. Of course someone asks me what I’m doing in Asheville, and I say: “Well I’m at this conference.” “What’s the conference about?” “It’s about psychometrics.” “What is psychometrics?” And the answer which seems fairly satisfying is, “it’s statistics applied to psychology,” and of course that’s a little bit broad for what psychometrics actually is, but psychometrics sits in that realm. I think the main way in which psychometrics is a little narrower than broadly statistics applied to psychology is that it’s in some way involved in measurement or accounting for individual differences in a way that other applications of statistics in psychology are not. And so I think psychometrics is largely a set of quantitative and particularly statistical methods that are useful for modeling and learning about individual differences and the performance of cognitive and noncognitive tasks. So do you think psychometrics is always tied to psychology, in one way or another? That’s a good question. In a certain sense, by definition psychometrics is tied to psychology, but the methods are really just the methods of latent variable modeling for individual differences, and those may or may not be tied to psychology. I just had a couple of graduate students present here at the conference, on social network analysis. It doesn’t seem like it should be very much related to psychometrics or anything else, but in fact the models that my graduate students are working on - and these are models that existed in the literature before my students started working on them - are latent variable models, and they’re models for individual differences about tie-formation, about the formation of relationships among actors in a social network. And just as you would have latent variables for individuals in a psychological study or a psychometric study, you have latent variables for individuals in a social network, and the latent variables help explain why there are or are no ties among individuals; it’s the same set of mathematical and statistical tools. So we really benefit a lot from the application of statistics in more traditional psychological problems but also by extending those ideas and hopefully eventually returning those ideas back to psychometrics. I read somewhere that you like to take ideas from different fields, and … … put them together, that’s right. That has very much to do with my conception of how statistics actually works. Statistics is a kind of a crossroads of the sciences, and so there is psychology, there is physics, there’s geology, there’s all sorts of fields, which in one way or another use ideas from statistics. Statistics is strongest when it helps one field to figure something out and then realizes that the techniques, the quantitative techniques, in that field are useful in another field and can help there too. And so as a statistician, I’ve always been interested in the idea of transferring techniques from one field to another, and often I’ve transferred techniques into psychometrics, but in the case of social networks analysis, I’m transferring techniques out of psychometrics, and it’s actually great fun.
18 Brian Junker
245
Some people would say that psychometrics has become quite narrow, mostly dealing with IRT-related problems. Do you agree? This sounds a little bit tautological, but it depends on how narrow you define psychometrics. If you define psychometrics as what goes on in the Psychometric Society, that’s a bit narrow; it really is. But if you think of psychometrics a little more broadly—for example, there’s a relatively new international society called the International Educational Data Mining Society, and if you look at the work they’re doing, they are in some cases reinventing what members of the Psychometric Society already know. They’re reinventing those methods and models in contexts that are very different from the contexts that we usually think about, with data that’s very different from the kinds of data that we think about. In many cases they’re also extending ideas that either they learned from conventional psychometrics or they reinvented, to handle situations where conventional psychometric models don’t work. So I think there’s actually a lot of interesting psychometric work that’s not called that, for example, in the Educational Data Mining Society. I mentioned earlier the work of John Anderson. There’s a beautiful cognitive psychological model, which has a lot of features of psychometrics in it but isn’t recognized as psychometrics, although I think John knows that, in some way, he’s doing psychometrics. You know, there’s this famous book of Thurstone’s, called Vectors of Mind,12 and one of John’s books about the ACT-R model is called Rules of the Mind.13 Yes, he knows! He knows, he absolutely knows, that’s right. So, I think there’s a narrowness to the Psychometric Society. I see at the edges some broadening, and that’s great to see. I think that’s very important for this Society. Do you think the Psychometric Society should become broader than it is now? Yes, I do. But I also recognize that that’s a slow and difficult process, and it’s a process that is made more difficult by the very understandable desire that, for example, what gets published in the journal is of a highly mathematically rigorous nature. And the work that’s done in EDM, the Educational Data Mining Society, is typically not as mathematically rigorous as what you would find in Psychometrika. You find more applied work there, as you find with computer scientists, people who do machine learning and data mining work. Oftentimes the way the work proceeds is that one needs a way to deal with a large amount of data, and so one invents an algorithm that scales to a large amount of data and does essentially empirical studies to show that the algorithm is successful, but there is not much in the way of theoretical work. There certainly is theoretical work in machine learning, but that typically doesn’t happen at the level of the Educational Data Mining Society.
12 13
Thurstone, L. L. (1935). The vectors of mind. Chicago: University of Chicago Press. Anderson, J. R. (2014). Rules of the mind. New York: Psychology Press.
246
18 Brian Junker
Do you think that dealing with those larger data sets is part of the future of psychometrics? I think it is the future of quantitative analysis in general. I think we have all of the tools that we need to be collecting huge amounts of data all the time. We need to figure out what to do with that data, when there’s useful signal in that data, and when that data is either mostly noise or perhaps there’s signal but the signal is biased because of selection effects in collecting the data. We need to figure out how to build methods of inference that scale to large data but that are also consistent with what we know has to be true for rigorously established models for smaller data. I think all of those things are really important. And they’re just as important in psychometrics as they are in statistics, in machine learning, and in other areas. Would you ideally see psychometrics becoming a more general field, relating to different types of research? I think that would be a good thing. It’s partly this crossroads idea again. The Society has a great deal of expertise in this kind of two-way and more generally multi-way modeling that involves latent variables and some fixed effects for variation across those different dimensions. When you find other areas where those models are useful, not only do you end up helping those areas, but you find some problems you need to solve that create new methodology to solve those problems, which we can then use in psychometrics. I think that’s really important, and I’d like to see the Society become more of a crossroad. So, when you look at your career, so far at least, what do you think is your most influential work? The most influential work … What will you be remembered for? Oh my goodness. I know, it’s a big question. That’s a much harder question! Okay then, we’ll stick to the first. I’m going to guess that in 30 years, I won’t be remembered at all, which is fine. I’ve had a great deal of fun in my career, I’ve made some contributions I think are useful and important, I’ve moved the field in various ways, and if no one remembers me in 30 years, that’s just fine.
18 Brian Junker
247
Yet … I think, right now, based on citations, the most influential paper is the paper with Klaas on cognitive diagnosis models, the NIDA and DINA models.14 Are you still working on those models? Are there any plans? Not so much. Again, I tend to skip around, and when I’ve answered a question that I was interested in, I look for something else that I’m interested in. As I said, right now my energies are kind of focused on social network analysis, because I’ve been curious about that, but there are still interesting questions in cognitive diagnosis and diagnostic classification models, and if one of them catches my interest, I’ll be back there, but I just kind of skip around and look for stuff that’s fun. I like that approach. When you look at the history of psychometrics in general, what do you think is the most influential book or article ever written? For me personally, probably the most influential book that I’ve read is Lord and Novick,15 and that’s partly because it’s encyclopedic. It has everything from factor analysis to IRT and other things that are relevant to standard measurement questions in psychometrics. Another reason is that, especially in the latter part of the book where the IRT stuff is discussed, there’s a real effort to connect psychometrics to current thinking in statistics. And when I look at earlier work in psychometrics, there are some efforts, and some people did try to do rigorous statistical work in psychometrics, but by and large, this was the first book that I was aware of where there was really a principled effort to connect psychometrics with statistics, and that made a great deal of sense to me. In terms of things that have been influential in the field, not so much books, but the field probably wouldn’t exist without Spearman and Thurstone. And it’s extremely important for them to have recognized this idea of developing factors and then multiple factors to explain in some mathematical sense human behavior; this was an extremely important idea. More recently, I’d say the paper on EM by Dempster, Laird, and Rubin,16 and the work on MCMC, especially the Gelfand and Smith paper17 which brought Markov Chain Monte Carlo into the awareness of statisticians, even though it had been around for a couple of decades. There’s a very nice readable survey of MCMC methods, very readable and Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Applied Psychological Measurement, 25, 258–272. 15 Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental testing. Reading, MA: Addision-Wesley. 16 Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39, 1–22. 17 Gelfand, A. E. & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409. 14
248
18 Brian Junker
very applicable by Chib and Greenberg,18 again, in a statistics journal, not a psychometrics journal. But these methods, EM and MCMC, meant that you can write down a model that’s scientifically appropriate for the psychological phenomenon that you’re trying to measure and you don’t have to worry very much about whether you can estimate the model. EM, with a little bit effort, and MCMC, with not very much effort but a lot of patience for the computer program to run, can estimate anything. So it has really given those of us who like building models for all these situations a great deal of freedom in building those models and knowing that we at least have a shot at estimating, without spending years trying to figure that out. So I think EM and MCMC have been really important in kind of expanding the scope of psychometric modeling. What would you consider the biggest achievement of psychometrics? You’d be hard-pressed to find a larger impact on society for psychometrics than large-scale standardized educational testing. It has been by and large a positive impact, I think. It’s another case in which ideas and principles from psychometrics went out into an application area and helped that application area to solve legitimate problems that existed in standardized educational measurement. And at the same time, there were new problems in that area, for which we had to develop new methods, and those methods came back to psychometrics. So I think it’s been really fruitful for psychometrics and really important at a societal level. You say “by and large a positive contribution.” Do you think it may have had some negative effects as well? Well, there are always the traditional validity-reliability debates. I’m actually a co- chair of the design and analysis committee for the National Assessment of Educational Progress, and I think about practical applications of psychometrics in these areas a lot. Whenever I’m consulting with someone or I’m thinking about educational measurement, the validity-reliability trade-off is always there, and it’s always extremely important to think about. It has occasionally led to—some would say frequently, but at least occasionally—assessments which are basically too narrow for what they’re trying to assess. And on the other hand, on the other extreme, you can find assessments that have a great deal of face validity and substantive validity but appear to have so much measurement uncertainty or lack of focus on what’s trying to be assessed. You have to figure out in every case where the trade-off is most beneficial and when that trade-off isn’t well made; those are cases in which psychometrics isn’t helping so much.
Chib, S., & Greenberg, E. (1996). Markov Chain Monte Carolo simulation methods in econometrics. Econometric Theory, 12, 409–431. Chib, S., & Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. The American Statistician, 49, 327–335. 18
18 Brian Junker
249
You already mentioned Lord and Novick’s book, and you also mentioned Spearman as a very important ancestor. According to you, who is the biggest psychometrician who ever lived? Whose work inspired you? I have no idea who the biggest psychometrician was, but it’s definitely the case that I was deeply and strongly influenced by both Bill Stout and Paul Holland. Those two gentlemen are extremely good and deep thinkers and really good at communicating both the intellectual content and the excitement of the field. And I was captured. What about their work inspired you? What did they teach you? Bill taught me first of all that there really is a place for rigorous thinking in applying mathematical statistics and related methods to problems that don’t at first look like they would be conducive to that approach. The idea to think rigorously about certain problem is, I think, huge. Paul has great and wide intellectual curiosity, and I think being around someone with that kind of breadth of intellectual curiosity is infectious. Everything is interesting to Paul, and that’s great, and both of them are really enthusiastic about what they do, and they’re enthusiastic about getting other people involved in what they do. And that kind of enthusiasm is so important and so infectious. And you have that too, right? Haha! You have a strong interest in different topics, and you don’t stick to your darlings. Well, from the point of view of psychometrics that’s true; from the point of view of statistics it’s much less true. Within psychometric and related things, I move around a lot, basically as a statistician with deep interests in psychometrics, but within the field of statistics, I haven’t really moved very far from measurement questions, psychometrics, and applications of statistics to psychology. But you can find other statisticians who have done everything from factor analysis to astrostatistics, so it depends on how big the scope is, but certainly I move around within the scope of stuff that has interested me over the years. It keeps things interesting! So what do you think is psychometrics’ biggest challenge for the future? I think we talked about it a little bit before. I think the challenge for the vitality of the field is to become a little bit more of a crossroads, and to the extent it can do that, I think psychometrics has a great future.
250
18 Brian Junker
It’s not dying out. No, and it won’t die out if it’s successful in making connections with cognate fields; that’s going to be the key. And what are your own plans for the future? Are there still problems you want to solve? I’m going to continue to work; I’m going to continue to look for interesting questions. Right now, there are questions in social networks that interest me. Another area that interests me, because of the work that I’ve been doing with Jodi Casabianca, is returning to a set of models called hierarchical rater models.19 These are basically hierarchical Bayesian or multilevel models for three-way data, and the three ways are students, tasks, and raters. If you look with a fairly mathematical eye, and one of my graduate students, Lou Mariano, did this for his dissertation, you’ll find that many approaches to combining information from multiple ratings tend to combine the information in a way that isn’t plausible given that the ratings have some dependence structure.20 This hierarchical rater model that I developed again with Rich Patz and a couple of other graduate students, Matt Johnson and Lou Mariano, accumulates information from multiple raters in a way that makes good statistical sense. And Jodi Casabianca who’s been at UT Austin, in the School of Education there, has been applying and extending those models in new situations, situations in which you really have very loose sparse designs for the assignment of raters to tasks and also designs in which you’re rating at multiple moments over time. So, she’s been developing longitudinal versions of the hierarchical rater model, and that work has been very interesting too. So in the near term, I’ll probably be continuing with the social network stuff and continuing with the hierarchical rater model work with Jodi. I think those are the kind of near-term goals. I don’t know what tomorrow brings, whatever looks interesting. Thank you for this interview!
Casabianca, J. M., Junker, B. W., Nieto, R., & Bond, M. A. (2017). A hierarchical rater model for longitudinal data. Multivariate Behavioral Research, 52, 576–592. 20 Mariano, L. T. and Junker, B. W. Covariates of the rating process in hierarchical models for multiple ratings of test items. Journal of Educational and Behavioral Statistics, 32, 287–314. 19
Chapter 19
Jos ten Berge
“It’s a very beautiful situation; a discipline that distrusts their own results.” Jos ten Berge is emeritus professor of psychometrics at the University of Groningen. He was president of the Psychometric Society in 2009. Ten Berge earned his Ph.D. at the University of Groningen in 1977 under the supervision of John van de Geer, Willem Hofstee, and Ivo Molenaar. His main research interests are factor analysis, reliability measurement in classical test theory, and rank and simplicity of three- way arrays.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_19
251
252
19 Jos ten Berge
Well, thank you, Jos, for participating in this project. This project is about the history of psychometrics, and I’ll ask a few questions concerning three different themes. The first one is your career as a psychometrician; second, the relation between psychology and psychometrics; and finally also the history and future of psychometrics and your views on these. As a start-up question: how did you end up in psychometrics? I was studying psychology and took a course on classical test theory, where we were taught to derive reliability coefficients from certain independence assumptions, and I was fascinated by that. Then I majored in psychometrics, at the University of Groningen. I stayed here all my life, except that before I went to college, I went to the United States for 1 year. I didn’t know what to do after high school, and I had the opportunity to study at a college in the United States where I went hoping to find my future career. And when I got back, I chose psychology. I had taken some courses in psychology in the United States, and I thought that might just be the best thing for me. What exactly was it that attracted your attention to psychometrics, as opposed to maybe social psychology or clinical psychology? I liked the fact that things are true or false in psychometrics, whereas in social psychology, they are sometimes true, and many times they aren’t. I liked the mathematical part of psychometrics. Was there a specific person who inspired you? The teacher of that course in test theory was Willem Hofstee, and he sparked my interest in psychometrics. Yes. Did you know at that time that you wanted to become a researcher, or did you have other plans? I think all along I knew I wanted to be a researcher. I didn’t fancy myself testing children or doing psychotherapy or any of that. Then you did a Ph.D. here, in Groningen. Who were your advisors? John van de Geer of Leiden University and Willem Hofstee and Ivo Molenaar from Groningen University. What did they teach you? One of the most important things I learned in my Ph.D. study was how to write a scientific paper, how to argue. They didn’t interfere much with my research. I chose the topic, and I wrote a thesis on congruence and Procrustes rotations in factor analysis, and then I asked Van De Geer if he wanted to be the main thesis advisor, and he agreed. I asked Wim Hofstee and Ivo Molenaar to be the local thesis advisors, and they agreed, and well, then I defended the thesis, and it was done.
19 Jos ten Berge
253
As if it’s nothing! Well, it took no less than 7 years. It was fast back in those days. Sounds like a very long time. Did you change much in your research after? Rotations in factor analysis was one of the topics. I also picked up other topics like reliability measurement in classical test theory and three-way analysis. What would you consider the field that you spent most time on? What did you become known for? Within the realm of psychology, the work on rotations in factor analysis and reliability were of primary concern. Outside psychology, no one really cares about those. But the three-way analysis research is very popular outside psychology, in chemometrics and electrical engineering, for example. Can you explain the basics of three-way analysis? Suppose you ask a number of people a number of questions, and you repeat that measurement over time, always involving the same individuals and the same questions. That creates a block of data: a three-way data block. Three-way component analysis takes advantage of the fact that you have the same individuals and the same items over different measurement times. This gives information you cannot easily obtain from repeating a number of independent component analyses. Why was it that psychologists were less interested in this? Because psychologists seldom have data that fit the three-way model. Chemists do! How involved was your work with psychology itself? I was working on methods that were designed to serve psychology and related fields. Factor analysis and rotations are mainly used in psychology. So, there is a connection there. Reliability measurement is mainly used in psychology and hardly anywhere else. Would you say you’re also still a psychologist? No, I don’t see myself as a psychologist. I did not examine psychological questions other than that I wanted to see how the methods panned out in practice. I worked on factor analysis on intelligence data, for instance, to see how factor analysis worked, how it performed, but not because I’m so interested in intelligence data per se. That’s for other people to worry about. Was there another inspiring person whom you admired at the time? Maybe who was not directly involved with your own Ph.D.? Probably, but I cannot recall who they were.
254
19 Jos ten Berge
Are there people now who were very inspiring for your work over the last decades? It’s not so much about the people, but there are a number of papers that I really loved. The paper that I enjoyed most—ignoring papers from colleagues like Henk Kiers who were in the same department—is the paper by Woodhouse and Jackson1 and also Jackson and Agunwamba’s paper on the greatest lower bound in reliability theory.2 I had been working on that topic with a colleague, Frits Zegers, and we invented a whole series of new lower bounds, which were better than coefficient alpha.3 But then we saw this paper, telling us how to find the greatest lower bound, which was far more useful, and I really enjoyed that. The greatest lower bound made the lower bounds we developed more or less useless. How is the lower bound different from Cronbach’s alpha? It uses all information in the data. Cronbach’s alpha and other lower bounds use only part of it. Did that paper have a lot of influence on the field of reliability afterwards? No, surprisingly little. A couple of years ago, Klaas Sijtsma drew attention to the fact that the greatest lower bound has not received the attention it deserves, but actually it’s surprising how little attention it has received. It is probably because you need special software to calculate it. It’s not as easily computed as Cronbach’s alpha, for example. You need an iterative procedure, which is not trivial to program. Is that also an explanation for why psychologists who use those tests are not eager to use it? They might just be unaware that it exists. When you look back at your career, have you ever received strong criticisms from other parties that told you were completely wrong? No, but that would be really hard in the mathematical psychometrics I did. You derive or prove a theorem, and when the proof is correct, it will be accepted for Woodhouse, B., & Jackson, P. H. (1977). Lower bounds for the reliability of the total score on a test composed f non-homogeneous items: II: A search procedure to locate the greatest lower bound. Psychometrika, 42, 579–591. 2 Jackson, P. H. & Agunwamba, C. C. (1977). Lower bounds for the reliability of the total score on a test composed of non-homogenous items: I: Algebraic lower bounds. Psychometrika, 42, 567–578. 3 Ten Berge J. M. F. & Zegers, F. E. (1978). A series of lower bounds to the reliability of a test. Psychometrika, 43, 575–579. 1
19 Jos ten Berge
255
publication, and you’ll never hear of it again. There’s not much room for conflict there. Were you critical of other people’s work? Sometimes, when people published a paper that contained invalid proofs. Did you ever experience having doubts about being a researcher? Do you mean if I ever asked the question of how important my research is? Occasionally, I have misgivings when I watch TV at night and I see what’s happening in the world; I realize that whatever I do is very remote from what’s happening in the world. It made me think that we shouldn’t overrate scientific research. There are other things going on. You mentioned you occasionally collaborated with psychologists, not necessarily because of what they were studying but because you were interested in how the model worked. What do you think should be the relation between psychology and psychometrics? It’s really very simple. The psychology departments hire psychometricians, because the psychologists need them to carry out data analyses. I expect that to always remain that way. Do you think that psychologists tend to overlook psychometrics? It’s not that they overlook psychometrics; but they shy away from it. Psychometrics is difficult which makes people feel uncomfortable. But many psychologists have done excellent work without knowing the first thing about psychometrics. Do you think knowledge of psychometric methods is necessary to be a good psychologist? It’s helpful, and at one point, it’s always necessary. When you want to prove that something is really going on out there, you need psychometrics, or more generally, you need statistics. Psychometrics is really a special case of statistics. What is then still the distinction between the two? Statistics is far more general and deals with lots of methods psychometricians don’t really need. Would you say that psychometrics is closer to statistics than psychology? It’s right in between the two.
256
19 Jos ten Berge
You already mentioned what psychometrics can contribute to psychology: developing knowledge that is in fact true. Is that the main purpose of psychometrics in psychology? The first purpose of psychometrics is to help exploring psychological data, finding what is in there and then putting it to a test, sooner or later. The test could be simply that you repeatedly find the same results in different data sets, for instance, the same components in different component analysis. Do you consider components or factors as psychological entities or as statistical entities? They are inventions that help us to summarize what we’re talking about. Is that what the psychologist is interested in? I think the psychologist needs those summarizations of information, because they’d be swamped with data if there’s no structure that can be brought in. Sure, but they also want to draw conclusions about psychological attributes, which they consider are those same factors or components. That doesn’t bother me, but you asked me a philosophical question. If there is a world where components exist, it is in our minds and nowhere else. And they can be very useful. Are there practices in psychological research that annoy you? Psychological practices that you disapprove of? I’m sure I encountered practices that I didn’t go along with, but I’m not aware of one common mistake that psychologists make, so I cannot answer that question in any useful way. Do you read psychological research? I read about psychology research in newspapers, and I’m always surprised by the lack of knowledge of the journalists. They don’t know enough of psychology, the laws of research, of experimental research. They’re always drawing inferences from correlations. For example, this week I heard that husbands who do the dishes have more sex. I’m sure that this will at some point be translated to a causal statement that when you do the dishes more often, you will get more sex, or the other way around. So this is not the psychologist’s fault, but the journalist’s. In the journal articles, you may find cautionary notes that this is a correlational phenomenon and not an experimental proof. But once the news gets out in the press, that cautionary note is typically missing.
19 Jos ten Berge
257
Do you have maybe scientific interests outside psychometrics and psychology? Are you interested in other fields? Nothing specific. I like to read about developments in medical science, but most people do I think. No physics envy? No, physics is beyond me. Many people often admire their research; it’s very clean, and they work with laws, and they can be quite certain about causal effects, for example. Many psychologists and psychometricians are maybe envious of that, because we might never be able to achieve that or at least we should work very hard to maybe achieve that at one point. Do you suffer from that at all? No. In the work I did, I was very often working with proofs, and when the proof is done and the reviewers agree that the proof is correct, then that’s it, and I’m completely happy with that. Which proof was most influential? My most cited proof is a paper on orthogonal Procrustes rotation, from 1977.4 It was an appendix to my thesis. I just looked it up, I wasn’t aware that it was actually my most cited paper. People were interested in that because it’s a technique that is often used both in psychology and in areas outside of psychology, like biology. Is there maybe work you’ve done that you think should’ve attracted more attention? Work that I feel was ignored? Yes! That was my work on the communality problem in factor analysis. The communality problem is the problem of separating shared variance from unique variance. It helps in finding factors that determine the correlations between variables. Component analysis is focused both on the variance in the variables and the correlation between the variables. When you’re only interested in what brings about the correlations, you need communalities. I think Henk Kiers and I solved the classical communality problem,5 but it hasn’t been picked up much. I’m surprised that it wasn’t picked up, but probably, again, the problem was that you needed special software to run it. You need to run an iterative program, and that is not generally available. I tried to get it in SPSS, but they were not interested.
Ten Berge, J. M. (1977). Orthogonal Procrustes rotation for two or more matrices. Psychometrika, 42, 267–276. 5 Ten Berge, J. M. F., & Kiers, H. A. L. (1988). Proper communality estimates minimizing the sum of the smallest eigenvalues for a covariance matrix. In M. G. H. Jansen & W. H. van Schuur (Eds.),The many faces of multivariate analysis. Proceedings of the SMABS-88 Conference in Groningen (pp. 30–37). Groningen: RION. 4
258
19 Jos ten Berge
Would you propose that psychology education should involve more training in psychometrics so that psychologists know how to handle these problems and use the right software? You mentioned it twice that something wasn’t picked up because it was complicated to apply. This has nothing do with the training of psychology students. This has to do with the preferences of psychometricians. I think that most psychometricians—and most psychometricians working in factor analysis—prefer a statistical test of a model. They prefer to reject or accept that model, to adjust it, and test it again. The solution for the classical communality problem that Henk Kiers and I developed doesn’t involve a statistical test. There is no rejecting a null hypothesis: our solution explores, and that’s still less popular than hypothesis testing. Do you think that that is an attitude that should be changed, because exploring might be just as interesting as testing? I think that attitude is changing. How is psychometrics changing? Over the last 40 years, psychometrics has become more and more complicated, which means that it has become increasingly inaccessible to the ordinary psychologist. That is a problem, and the solution to this problem is to hire a psychometrician to interpret the methods for those people who need to use it. The solution is not to improve the statistical or psychometric training of psychologists. No, that wouldn’t help. And that’s why you think that psychometrics will always have a future because you’ll always need those people. I think so, yes. When you look at the history of psychometrics, what was the biggest achievement for psychometrics? The notion that scores contain a part that you can call error, which is at least 100 years old. The very simple fact that you don’t get the same result when you measure someone’s intelligence twice means that at least one of the two measurements cannot be correct. The existence of measurement error cannot be denied, and taking error into account has been one of the major accomplishments of psychometrics. It is a very interesting fact that psychologists have a routine of evaluating their measurements, for instance, by reliability and validity studies. It’s a form of self- criticism that isn’t often enough appreciated. It’s a very beautiful situation; a
19 Jos ten Berge
259
discipline that distrusts their own results. That’s basically the attitude of science. Distrust what you find, what you produce. Rather than distrusting what other people do, start with your own research, your own work. It’s a major contribution to the world. When you think of the influence of psychometrics on society, for example, what is its biggest achievement in society? Probably the measurement of intelligence, which is applied psychometrics of course. Looking forward, what do you think is psychometrics’ biggest challenge? The problem could be that universities stop funding psychometrics, which could be a real problem. There has been a tendency in that direction in the United States since a few decades, and it may still be going on. So maybe the challenge is that psychometrics should survive cuts. Why were they cutting down on psychometric research? They probably thought there was already enough psychometrics. Regardless, you need psychometricians to help you analyze your data. The psychometricians need to have their own research topics, evaluating the methods that we have, and that will go on forever. But strictly speaking I don’t see much future for developing completely new methods because there’s so much around that needs calibrating, validating, and improving. The task of psychometrics has more to do with improving the methods that we have, helping and supporting psychological research, rather than inventing … … inventing things from scratch. Yes, there is a law of diminishing returns, which applies here too. Psychometrics has had its prime, is that what you’re saying? Yes, there has been an explosion of psychometric methods in the past, and the future will show less new methods and more examining of the methods that we have rather than developing completely new ones. Is that a shame, or is that ok? No, that’s perfectly alright. That’s how it should be. It’s no disaster when for the next century psychometricians focus on comparing the methods that we have. Occasionally they will find a small change which is an improvement, and then you can call that a new method. In that sense, new methods will always arise.
260
19 Jos ten Berge
Is there another scientific area that psychometrics can learn from as it were? Psychometrics does not exist, psychometricians exist. That you’ll have to explain. Psychometrics is just a word; psychometricians are actual living people. Psychometricians learn from their colleagues, like econometricians and chemometricians. They go to the same conferences, and that has been taking place for decades.
Chapter 20
Klaas Sijtsma
“After all these years, I have concluded that psychology should be leading for what we do.” Klaas Sijtsma is emeritus professor of psychometrics at Tilburg University. He was appointed University Provost of Tilburg University as of September 1, 2019. He was president of the Psychometric Society in 2010. Sijtsma earned his Ph.D. at the University of Groningen in 1988 under the supervision of Ivo Molenaar. His main research interest is the measurement of individual differences on psychological attributes.
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_20
261
262
20 Klaas Sijtsma
Klaas Sijtsma, thank you for your participation in this oral history project on the history of psychometrics. In this interview, I will be asking you questions about your career as a psychometrician, the relationship between psychology and psychometrics and other disciplines, and finally your view on the history and future of psychometrics. I always start with the question: How did you end up in psychometrics? How did I end up in psychometrics? I started as a student of pharmacy at the University of Groningen in 1974, but pharmacy could not really interest me a lot because it was totally different from what I expected. It turned out to be mainly chemistry, and I expected much more biology and other topics. At the age of 18, you need fast results, and you do not have the patience to wait very long until the study of your choice becomes more attractive. After 8 or 9 months, I decided to stop and find another topic as my major, and that became psychology. However, it turned out once more that I was still just a young person in search of a purpose, and like pharmacy, psychology could not really interest me from the start. I missed something that gave me a little more grip, something that gave me a little more foundation for what I was doing. That actually came in the second year of my study when I studied the textbook by Pieter Drenth, Introduction to Test Theory.1 I had to read that book for an obligatory course, and while reading, I remember I was completely surprised by learning that psychologists actually measured attributes, and this was exactly what I had done when I was a pharmacy student for a whole year: I measured attributes. In what sense? Well, I measured the concentration of particular molecules in compositions of fluids, for example. This is important for pharmaceutical problems. I also measured electrical currents and the radioactivity of materials. I had to practice these measurement procedures for a whole year, as they were part of the lab experiments we had to do. So, we learned how to do measurement and also to assess the inaccuracy of measurements. We would read measurement values from a scale, like a thermometer, or a value that was indicated by a needle on a clocklike instrument like an ammeter. A needle always trembles a little bit, so what you had to do is give your best guess of the measurement value and assess the inaccuracy of that value. What would be a lower bound for the measurement value you read of the scale, and what would be the upper bound? Then you would do the calculations for those three values—the target value, the lower bound, and the upper bound—and that meant that for all the computations you had to do to get a result for your experiment, you actually used an empirically established confidence interval. The term “confidence interval” was never used of course; this was simply how pharmacists did their research.
Drenth, P. J. D. (1975). Inleiding in de testtheorie. Deventer: Van Loghum Slaterus
1
20 Klaas Sijtsma
263
When I read Drenth’s book, I was completely surprised that psychologists also measured things, attributes, though it was a little bit different from what I learned in pharmacy: You did not read anything from a scale or tried to pinpoint a needle, but you counted, for example, the number of correct answers. Now, that is an objective quantity. Of course, you can make an error counting, but if somebody answers 21 questions correctly, that is an objective truth. You cannot estimate a lower bound and an upper bound observation, because there is only one observation, and that observation is objective. I never expected that psychology actually entertained the idea that there is uncertainty in measurement and then came up with mathematical models to estimate that uncertainty. I am thinking of reliability, standard error of measurement, and so on. This fascinated me and inspired me a few years later to choose two different master programs. One was personality psychology, which is the study of individual differences, and the other was statistics and measurement theory. This was the perfect combination to learn a lot about how psychologists do their measurements. So, I finally found a topic that fascinated me, and actually, pharmacy played a role in that. Not only as a “bad” choice I made when I was only 18 years old but also because I did something there that I did not care a lot about at the time—measurement—but that I recognized later in a different field, done very differently but according to the same principles, and that fascinated me. And that’s how I landed in psychometrics. You studied in Groningen, right? And you did your dissertation in Groningen as well, if I am correct. What was that about? That was about Mokken scale analysis.2 At the time, I worked at the Vrije Universiteit in Amsterdam, coincidentally, with Pieter Drenth. As I said, he and his book stimulated me to become interested in measurement problems, and a couple of years later he became my boss: I did not look for it, he simply offered me a good job, so I went to the VU. At the time, and I think it is still true today, but I am not sure—the Vrije Universiteit did not have a department of methodology and statistics or psychometrics or psychological methods. So I asked permission from Pieter to find a supervisor someplace else, and he was comfortable with that. He gave me all the freedom I desired. In a way he was a great supervisor, because he simply allowed me to find my own way and do whatever was good to produce the dissertation I aspired to write. So, I approached Ivo Molenaar at the University of Groningen, whom I already knew from the time I was a student at his Department of Statistics and Measurement Theory. Ivo previously had introduced me to the work of Rob Mokken, who was an old study friend of his and whom he admired, and I proposed to prepare a dissertation about Mokken scale analysis. Not unexpectedly, he thought it was a great idea, so that is how it all started. That is also why I decided to defend my dissertation at the University of Groningen and not at the Vrije Universiteit. Sijtsma, K. (1988). Contributions to Mokken’s nonparametric item response theory. University of Groningen 2
264
20 Klaas Sijtsma
Because of Ivo? Ivo was the main supervisor, and we all thought it was reasonable that the defense would take place in Groningen. It was great that Pieter also allowed me this freedom; I don’t think this would be possible today. Can you explain what Mokken scale analysis entails? Mokken scale analysis is a scaling procedure based on two item response models. As you know, item response models are measurement models, particularly suited for the measurement of psychological attributes. The two-item response models are based on rather weak assumptions compared to other IRT models, but not that weak that you cannot do anything practical with the models. Rather than a metric scale, the models imply an ordinal scale of measurement, for both dichotomous items and polytomous items, though a little bit differently for polytomous items. What is great about these models is that if they fit to the data, they imply an ordinal scale for person and item measurement. This means that if you use persons’ simple sum scores on the items to order the persons, then, by implication, you have also ordered them on the latent variable, though you have never been able to estimate the latent variable. This is because the models’ likelihoods do not contain the person (and item) parameters; the item response functions are only ordinally restricted, and that is all you know about them. This is a very neat result, because the problem in psychology actually is, or was at the time but still is I believe, that there is very little substantive theory that supports what a particular attribute is. Theories about intelligence and personality are very generally formulated. Very few crucial experiments exist, to my knowledge, which support a particular structure for those attributes. Because we have so little knowledge of these attributes, it does not make a lot of sense to have measurement models that imply many restrictions on your data. Because there is not a theory that tells you what the structure of the attribute is, you do not know well how to operationalize the attribute, that is, how to translate the attribute theory into a number of items that you can use to extract the responses from people in such a way that they tell you something about the attribute. The connections are weak, so you cannot make precise predictions about what to expect. And if you do not know what structure to expect in your data, it is not a good idea to have restrictive models, because these models imply a structure that you usually will not find in your data. Then, it is often better to have a weak model, which allows a lot of leeway in your data, but still imposes enough structure that, if present in the data, implies an ordinal scale. Is Mokken scale analysis still one of your main research topics? I still do research in that area, but Ph.D. students do most of the research. That has been the case for a long time now, though I also do some things myself. I worked together with a couple of colleagues, who have become very interested in the topic. Quite recently, Andries van der Ark and I wrote a paper for the British Journal of Mathematical and Statistical Psychology, which is meant as a tutorial for Mokken
20 Klaas Sijtsma
265
scale analysis.3 It is yet to be published, but it has been accepted, and perhaps it is somewhere on the web. So, I am still active in Mokken scale analysis. I think it is a very nice technique, and it goes very well with the poor state of theory in psychology. Do you still use restrictive models in your research? I do, because restrictive models have one advantage. They never fit to the data, and the misfit often is gruesome—that is, if you look with precision at fit issues, which I will come back to—but anything that goes wrong can also be very informative. You can learn a lot from a model that fails to fit the data: Understanding why a model does not fit provides you with a lot of useful information. A medical doctor learns more from a sick patient than from a healthy patient. As soon as something goes wrong in the system, a lot of information comes out. You said you wanted to say something about those fit measures. Is that an important other tradition in your research? This is a terribly important issue, but it is also a topic that has been neglected in general. What we have seen over the past decades is that statistical models have become more and more complex. Nowadays, models are much better able to describe complex data structures than previously, but this is hazardous when the data are not collected on the basis of a substantive theory. If the data contain both a lot of noise and many signals but you do not know what the signals mean, and you use very complex models to try to describe those signals, you still do not know what you’re really looking at. The goodness of fit problem is extremely important: If you use a model to describe your data, you have to know whether the model actually fits the data, whether the structure of the model is consistent with the structure of the data, and if not, why this is the case. If you look closely at the literature for many of the complex models, you will find that the goodness of fit methods for those models has often been poorly developed or not developed at all. What frequently happens is that people spend a lot of time developing models and estimation procedures for models, but do not spend as much time developing goodness of fit methods. The phrase for estimation of parameters in a model has recently become known as “fitting a model”; nowadays, people often say, “I have fitted a model to my data.” What they actually mean is, “I have estimated the parameters of my model.” “Fitted” means something else; it means, “I have investigated whether structure of my model is consistent with the structure of the data.” If the model does not fit, the robustness problem comes up: If my model does not give a good description of my data, how serious can I take the estimates of the parameters? Are they worth anything? This is complex research. For some models, these issues received a lot of attention, for other models much less, and that is a serious problem.
Sijtsma, K. & Van der Ark, L. A. (2017). A tutorial on how to do a Mokken scale analysis on your test and questionnaire data. British Journal of Mathematical and Statistical Psychology, 70, 137–158 3
266
20 Klaas Sijtsma
Do you think that when psychological theories become stronger, using more restrictive models makes sense? It depends on the theory and the predictions the theory makes. I think a wonderful example of a theory about attribute structure is formalized by the network research done presently by Denny Borsboom and Angelique Cramer at the University of Amsterdam. They have proposed a formal approach to describing particular traits, like depression, using networks, and this is something completely different from, for example, assuming an IRT model. They have many messages, but for someone like me, the main message is, “There is a theory that tells us something about a certain type of trait, which is that the structure of these attributes is not consistent with the structure of an IRT model.” This shows that the substantive theory is leading with regard to the measurement model you have to use. And there are more examples. Take, for instance, the balance-scale problem for proportional reasoning, which I think Siegler proposed in the 1980s.4 Both Han van der Maas and Brenda Jansen, myself, and a couple of colleagues have investigated balance-scale problems, and in different ways, we found that you have to use different latent classes, not continuous latent variables, to understand proportional reasoning. At the time when we did our research, which was 1989, latent class analysis existed but still had to be developed to be estimable, so we used a type of cluster analysis to find out about the structure in the data.5 A couple of years later, Han and Brenda tried to use latent class analysis, which still was not fully developed at the time, but they were more successful than we were.6 The actual outcome of that research was that it does not make sense to measure proportional reasoning on a continuous scale. Developmental theory predicts children use various strategies to solve balance-scale problems, and the four or five clusters we found and the latent classes Han and Brenda found later match these strategies very well. These are developmental classes, which are not ordered, so you have a nominal scale of classes, and measurement of proportional reasoning therefore has to be done on a nominal scale. So would you say in general you would encourage that approach? Starting out with theory and based on that choose a specific measurement model, unless the theory is too weak and you need less restrictive models like Mokken models. In general, I would. I am now in a phase of my career where I would encourage everybody to engage in theory building, to become a psychologist rather than a psychometrician. We have a huge toolkit of formal models, and of course we have to continue Siegler, R. S., & Vago, S. (1978). The development of a proportionality concept: Judging relative fullness. Journal of Experimental Child Psychology, 25, 371–195 5 Van Maanen, L., Been, P., & Sijtsma K. (1989). The linear logistic test model and heterogeneity of cognitive strategies. In: Roskam, E. E. (ed). Mathematical psychology in progress: Recent research in psychology. Berlin: Springer 6 Jansen, B. R. J. & Van der Maas, H. L. J. (1997). Statistical test of the rule assessment methodology by latent class analysis. Developmental Review, 17, 321–357 4
20 Klaas Sijtsma
267
developing these models and new ones, but I think the main message I would like to give to everybody is: To find sensible applications for the measurement models we developed, we now have to focus on developing sound psychological theories of intelligence, personality traits, and attitudes. These theories will tell us which models to use in particular cases. After all these years, I have concluded that psychology should be leading for what we do. Otherwise, we are simply developing mathematical structures without an empirical foundation. Of course, that is a fun thing to do and like many colleagues I have also made a career out of that, but if we think it is important to have good applications for our models, we might actually start spending time building theories. Do you consider yourself a psychologist? Unfortunately not. It is too late for me I think. You might also see this as an appeal to the younger generation, to be both a psychologist and a psychometrician, because I think the combination works best. If you are only a psychologist, you will not be able to meaningfully use all those mathematical and statistical models, because they are simply too difficult to understand or apply in a sensible way. A psychologist should know a lot about statistics and psychometrics but also know about psychological theory. I think history has shown that if you focus too much on psychometrics only, and forget about what the purpose of these models is—with which goal in mind you have developed these models—it does not often lead to useful results. And that is not a very original thought. Charles Spearman and also Lee Cronbach were good examples of early colleagues expressing similar thoughts, but over the past decades, psychometrics and psychology have grown apart. Why has that movement of drifting away from psychology taken place in the first place? That is difficult to say. Perhaps we need to take more distance in time to understand what has happened. One of the things is that if you are interested in formal models, in mathematics, in statistics, in methodology, you actually want to focus on those topics. That is what makes you tick, not building psychological theories. So all the time, a scientific community like psychometrics pushes these models to higher, more complex levels, which are more mathematically involved. Computer science also plays an increasingly important role. Computers have been developed at such a rapid pace, and they have become so extremely powerful that you cannot ignore doing something with them. Many problems we originally could not solve because they were too complex computationally can now be solved because we have these very powerful computers. Computers have pushed psychometrics to an even more complex technical level, away from psychology, and simultaneously, psychology did not move in the same direction. Psychology did not become more quantitative than it already was. Actually, psychology remained what it was, and much of it moved in the direction of neuroscience, which is a different direction. The two fields, psychology and psychometrics, grew apart for different reasons. We have the
268
20 Klaas Sijtsma
technology, but we do not have the theory, and in order to have a sensible application for what we do, psychology and psychometrics must come together again and work together closely. This thought of you that a psychometrician should also be a psychologist is something you developed over time. Does that imply that earlier in your career you were one of those classical psychometricians that did only technical work? Yes, I was! At one time in my career, I loved to tackle complex mathematical problems, even if I could not solve them. It simply gave me a lot of joy, and I have seen the same quality in many colleagues. There is nothing wrong with it: If you find something that you like, why not do it? If this is what gives you energy and motivates you, you would be silly not to embrace it. But there comes a time when you have to ask yourself: “What shall we do with all this work? Does it actually have a purpose?’ It takes a couple of years to find that out. It is not a shocking new insight, as I said. Are psychologists now trying to merge again with the psychometricians? Many of my colleagues work together with applied researchers, not only psychologists but also researchers from medicine, health research, and marketing. So, in many areas psychometricians work together with substantive researchers, which is good. It would be great if we could take that a little bit further than simply advice researchers which data analysis method to use. Consulting is extremely important because statistics is so difficult, and you cannot expect researchers to apply all those different techniques in the right way. It is a good thing that my colleagues are prepared to help researchers to do statistics in the right way. However, an additional step would be to engage in theory development. I would like to see theory development for psychological attributes, because I believe that good research starts with good measurement, but good measurement needs good theories. And are you now involved in such a project? Unfortunately, not. I am the dean now of this school,7 and although I try to be active as a psychometrician and as a supervisor of Ph.D. students, I am sorry to say I have very little time left. Besides Mokken scale analysis, what other have you worked on in your career? Over the past couple of years, I have become interested again after a very long time in classical test theory and everything related to it. Recently, I’ve been chair of the COTAN, the Commission of Test Assessment in the Netherlands, which is a quality As of September 1, 2019, Klaas Sijtsma is ‘rector magnificus’ or university provost of Tilburg University 7
20 Klaas Sijtsma
269
assessment committee for tests and questionnaires. Those are tests and questionnaires used in applications like personnel selection and clinical diagnosis, and the measurements are used to derive a specific diagnosis about an individual. It took me some time to realize that test researchers, test constructors, and test users have very little awareness of what psychometricians do. They have their own methods that they use. And what they use for constructing tests is classical test theory, for example, coefficient alpha. They usually have no knowledge that there are other methods for estimating test-score reliability. They also use principal component analysis, sometimes factor analysis, and a couple of other techniques. Actually, I think they use a very sensible toolkit, but in the eyes of the average psychometrician, their methods are hopelessly outdated. I am not sure whether that is true, but that is the perspective of the psychometrician who does all this highbrow work every day. They work on a completely different level than people who construct and use tests. So, I became much more aware of what the methodological and psychometric problems are of people who do the practical research and construct tests. This inspired me to become interested again in topics like reliability estimation, validity, norming methods, and test length issues. Nowadays, there is a tendency in many applied fields to use extremely short tests that contain only three, four, or five items. From a psychometric point of view, this is not a good thing to do, because short tests introduce a low reliability, but in practice people have good reasons to use short tests. They will tell you that they have to test patients, people who are not feeling well, and for them, being tested is a burden. You do not want to bother patients too long, so you present them with a few items rather than a whole test battery. I thought it was a nice topic to study the usefulness of short tests. One of my Ph.D. students, Peter Kruyen,8 wrote a Ph.D. thesis on this topic, and over the past couple of years, other classical test theory topics have been picked up by other Ph.D. students. Pieter Oosterwijk9 recently defended a dissertation on reliability estimation methods, Ruslan Jabrayilov10 defended a dissertation on change-measurement issues, and Hannah Oosterhuis11 will defend her dissertation on regression-based norming next year. All these topics were inspired by my time as chair of COTAN, and studying these topics it is a nice challenge to do something for them.
Kruyen, P. M. (2012). Using short tests and questionnaires for making decisions about individuals. When is short too short? Tilburg University 9 Oosterwijk, P. (2016). Statistical properties and practical use of classical test-score reliability methods. Tilburg University 10 Jabrayilov, R. (2016). Improving individual change assessment in clinical, medical and health psychology. Tilburg University 11 Oosterhuis, H. E. M. (2017). Regression-based norming for psychological tests and questionnaires. Tilburg University 8
270
20 Klaas Sijtsma
Do psychologists pick up on your methods? I have no idea. I know they read some of the work we publish. Now and then, I try to publish something about it in De Psycholoog,12 because this is an outlet they might read. I hope what we do is useful to them. Actually, now that we are talking about this, I think we should have a little marketing department that spreads the word; that would be more effective. Only doing the research, publishing the research, and presenting a paper now and then really will not help to get these new results across. We need to be more active, spreading the word so to speak. Do you think it is easier to sell classical test theory methods that are improved than modern testing theory to applied psychologists? That is difficult to say. Some difficult techniques are sometimes suddenly picked up. Could anyone have predicted 20 or 25 years ago that multilevel analysis would become such a popular tool among social scientists? And multilevel analysis is quite complex; it is far from easy to understand what is going on and how to specify the right model. But it appeals to a problem that many researchers recognize. I think that is important: The recognition of a problem. You see someone else struggling with a problem, or you see someone presenting a technique for solving a problem that you have been struggling with yourself for many years, and that makes you curious in using that technique. How do the psychometricians respond to the fact that you are working on classical test theory rather than modern test theory? In general, I think they appreciate it. I hope so! No hate mail … No, no! The psychometrics community is a very pleasant community of people who are very friendly and helpful to one another. I am not saying this because I am on camera now. It is really a nice group to be part of. When you look at your career, is there a psychometrician that has really inspired you? Many people have been very inspiring. I will only mention two or three. Charles Spearman is a great source of inspiration for all of us. Reading his papers from around 1900 or 1910 really is very joyful; it is very good work actually, it still is.
De Psycholoog is a monthly Dutch magazine on psychology, published by the Netherlands Institute for Psychologists (NIP) and aimed at practicing psychologists. 12
20 Klaas Sijtsma
271
Because he combines psychology with psychometrics? Yes, and because he was so revolutionary. He actually combined psychological problems he was struggling with, with the development of statistical tools that he needed to tackle those problems, and in a way, he is the founding father of classical test theory and factor analysis. That is not a small accomplishment; actually, it is incredible. I think Fred Lord is another very important psychometrician. He lived much later than Spearman—he was active between the late 1940s and the early 1980s—and he published quite a few important papers about topics in classical test theory and item response theory. He was one of the founding fathers of item response theory; his Ph.D. thesis from 1952 was about item response theory. Lord also worked together with Mel Novick and raised classical test theory and psychometrics to a higher mathematical level. What Lord and Novick did was provide a mathematical foundation for classical test theory, which resolved many logical problems and discussion that took place in the decades before them. Nowadays, it is well known that coefficient alpha is a lower bound to reliability—it is just a mathematical theorem that you can prove—but that is only thanks to people like Mel Novick and Fred Lord and in the specific case of the alpha theorem Charlie Lewis as well. If you read the early paper by Lee Cronbach from 1951 about coefficient alpha, in which he describes the state of knowledge as it existed then, it is clear that whether alpha was or was not a lower bound to reliability was still under debate. The reason for that was that there was not a sound mathematical description of classical test theory yet. There were different versions of classical test theory around and dependent on the version you liked, alpha was or was not a lower bound. Lord and Novick provided a sound mathematical basis so that we all knew what we were talking about, not only with regard to reliability but also for many topics. Lee Cronbach was also very important for me. Not so much from a mathematical point of view, but I think he had many good ideas; he was a source of inspiration for many people. Coefficient alpha, which can be criticized in many ways, resolved many problems surrounding reliability theory at the time. It was a great contribution actually. He also wrote several papers about construct validity together with Paul Meehl. He was the founding father of generalizability theory, which is not applied a lot but, as a thought model, it is simply a great accomplishment. He was also very active in educational psychology and published quite a bit about decision-making using test scores. He was active in many different fields and made important contributions in all those fields. What do you think is your own most influential work? I have an answer to this question, but it does not concern a scientific accomplishment. In the late 1980s, Pieter Drenth asked me whether I was prepared to write a new version of his textbook on test theory, the one that drew me into psychometrics some 12 years earlier. This book was a relatively simple introduction to test theory and was used in many psychology-training programs in the Netherlands and
272
20 Klaas Sijtsma
Belgium. I agreed, and together we published a new edition in 1990,13 which then was the third edition of the book. We also published a completely revised fourth edition in 2006,14 and I know that at least 30,000 students have studied test theory from that book. I can be humble about that, but I think it is a huge number in such a small language area. It is not an original scientific work but an education textbook, and I am proud of that. If you look at psychometrics as a whole, what do you think is its biggest achievement? The biggest achievement of psychometrics is to provide a good formal foundation for test construction, so that psychologists and people in educational measurement are well equipped to construct tests and questionnaires. Even though the substantive basis is not always sound, the technology that we have for test construction is very good. You could say that because psychometrics comes out of psychology, and psychology has provided itself with the means to construct good tests and questionnaires, which are now applied in most Western countries at a very large scale. Psychological and educational measurement has had and still has a huge societal impact. If you follow the discussions in the Netherlands each year, about the Cito tests, for example, you immediately understand how important testing has become in society. And I think this is our biggest contribution—here we matter. Here we really matter for society. In the future, you said it would be wise if psychometricians become psychologists, not only psychometricians. Is there another challenge that you foresee? Another challenge is to resist becoming statisticians and computer scientists. There is a lot we can learn from statistics and computer science, and we should never stand with our backs to those fields. We should embrace them, and I hope they embrace us. But we should never forget that our unique contribution is in the social and behavioral sciences and especially in psychological and educational testing. There is a tendency among some of us to become more of a statistician or a computer scientist. That is fine, everyone is entitled to have his or her own personal ambition, and I have nothing to say about that. If you are happy being a computer scientist or statistician, please be one. But that is not where our unique contribution is. Statistics and computer science were already there when we were not very much interested in them, so they really do not need us; we have a different aim.
Drenth, P. J. D., & Sijtsma, K. (1990). Testtheorie. Houten: Bohn Stafleu van Loghum Drenth, P. J. D., & Sijtsma, K. (2006). Testtheorie. Inleiding in de theorie van de psychologische test en zijn toepassingen (4e herziene druk) [Test theory. Introduction in the theory and application of psychological tests (4th revised ed.)]. Houten, The Netherlands: Bohn Stafleu van Loghum 13 14
20 Klaas Sijtsma
273
What are your personal plans for the future? I’ll be dean for another year. You never know how life goes, but I expect that I will not be a dean by September 1st next year. Then, I can take up an old project that I had to let go of when I became dean 5 years ago. I could not see that one coming, but all of a sudden, I was dean. Why was that? Because of the data fraud committed by Diederik Stapel, in this school. Because Stapel was also the dean of our school, when his integrity breach came out, the school needed another dean, and the University Executive Board asked me to step in. Let us leave that for what it is; numerous things have been said about the topic, and that is enough as far as I am concerned. But before that happened, Andries van der Ark and I signed a contract with a publisher to write a book about the use of item response theory and other measurement models in practical data problems. We wanted to write a book that makes sense to people who construct tests and questionnaires, introduce the more complex techniques, and show how to use those in a responsible and sensible way. We wrote about 5% of the book, and then I had to postpone writing for 5 years, and it will be 6 years next year. I’ve been promised a sabbatical for 1 year, and then we will write the book. That’s the plan.15
Sijtsma, K., & Van der Ark, L. A. Measurement models for psychological attributes. Boca Raton, FL: Chapman & Hall/CRC (2020) 15
Chapter 21
Hua-Hua Chang
“The biggest achievement of psychometrics is that it provides powerful tools to guarantee that data are collected and analyzed in a meaningful way.” Hua-Hua Chang is professor of educational psychology at Purdue University. He served as president of the Psychometric Society in 2012. Hua-Hua earned his Ph.D. at the University of Illinois Urbana-Champaign in 1992 under the supervision of Bill Stout. His main research interests are computer adaptive testing, differential item functioning, and asymptotic theory of IRT. © Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_21
275
276
21 Hua-Hua Chang
Thank you for your participation in this oral history project on the history of psychometrics. I’ll be asking questions on three different topics. One is your career as a psychometrician, another is the relation between psychology and psychometrics, and finally your view on the history and the future of psychometrics. And I always start with the question how you ended up in psychometrics. That’s because I studied statistics. I received a Ph.D. in statistics in 1992, from the University of Illinois at Urbana-Champaign. There is a tradition at UIUC that many people who are majoring in statistics, psychology, or education are involved in psychometrics. What did you do during your Ph.D.? In my dissertation, I tried to solve a problem, called asymptotic posterior normality, which was proposed by Paul Holland.1 The goal was to prove that the posterior distribution of latent trait for given test responses converges to a normal distribution for item response theory models under very general and nonrestrictive nonparametric assumptions; that was a typical statistical problem generated from psychometric applications. So, I solved the problem and published the paper,2 and now it has been cited more than 100 times. After my graduation I landed a job at the Education and Testing Service. I worked at ETS for 6 years, which was a good opportunity for me to learn more about psychometrics and identify challenging problems emerging from testing. What was it about testing or psychometrics that you found interesting? Now that we’re testing so heavily, we need to make sure the tests we use are reliable and valid. Tests are making substantial influence on the functioning of society by affecting how people are selected, classified, and diagnosed; psychometric research will lead to better testing and hence benefit society. You didn’t want to stay in statistics? Actually, I’d say I’m a psychometrician now. The journals I usually send my papers to are Psychometrika, Applied Psychological Measurement, the Journal of Educational Measurement, the Journal of Educational and Behavioral Statistics, and the British Journal of Mathematical and Statistical Psychology. However, statistics and psychometrics are so closely related. I was selected as a Fellow by the American Statistical Association for my contribution to educational measurement.
Holland, P. W. (1990). The Dutch identity: A new tool for the study of item response theory models. Psychometrika, 55, 5–18 2 Chang, H.-H. & Stout, W. (1993). The asymptotic posterior normality of the latent trait in an IRT model. Psychometrika, 58, 37–52 1
21 Hua-Hua Chang
277
Your Ph.D. advisor was Bill Stout, right? Yes, Dr. Stout served as a president of the Society, and he has two former students who also served as a president, Brian Junker and me. Because psychometrics is so closely related to statistics, he taught me how to work with statistical components in psychometrics foundations. We need to establish sound statistical methodologies for psychometric applications. My dissertation was half and half—statistical and psychometric. At Urbana-Champaign, I had the opportunity to attend so many seminars from education, psychology, sociology, and statistics, and I learned a whole lot from people who do assessments and testing. I think I was partially a psychometrician after my dissertation, but after 6 years working at ETS, I became 100% a psychometrician. So, where did you go after ETS? I went to the National Board of Medical Examiners, located in Philadelphia. I learned many new technologies, in particular game-based assessment, which is something just like a computer game designed to assess student’s clinical skills. We provided students with computer-simulated patients so that they had the opportunity to work in “real-world” situations. Such assessment is totally different than multiple- choice tests. I’m a practitioner-turned-professor; I worked 9 years in the testing industry before moving to academia. What was the first university you worked for? In 1997 I took a leave of absence from ETS and worked at the Chinese University in Hong Kong for 1 year. I taught educational measurement there, and I also collaborated with Dr. Kit-Tai Hau and published several papers. What is so special about the United States? Why didn’t you want to stay in Hong Kong? I decided to come back because my dream was to teach in the United States. The United States has more opportunities for psychometricians. I became associate professor at the University of Texas at Austin in 2001, and I stayed there for 4 years. In 2005, I was recruited by my alma mater, the University of Illinois, as a hire for excellence, and I returned to Champaign in 2005. If you look back at the years that you worked in psychometrics, what do you consider your most important research topics? I think I made three significant contributions to the field. One is in computerized adaptive testing. I worked very extensively on item selection algorithms, and most papers in this area received high citations. There was one single paper published in
278
21 Hua-Hua Chang
Applied Psychological Measurement, which is now cited almost 400 times.3 The second field is called differential item functioning. When I was at ETS, we had written up some theory and algorithms on how to identify potentially biased items. We also tried to make the program user-friendly. One program we developed, called the “Poly-SIBTEST4,” has been used by many practitioners. Test fairness is so important; we try to make public tests fair by providing sound psychometric methods. The third one lies in my dedication to promoting rigorous research on psychometric problems, such as the asymptotic theory for IRT models and certain large sample properties for computerized adaptive testing.5 I did many other things, but I consider these three the topics that people would remember, that the field will remember. Did you ever receive strong criticism on your work? Every time you try to propose a new method, you tend to receive criticisms, but most criticisms eventually can make your methods better. It is a good thing to receive criticism because you got others’ attention. And just listen to them to see if what they said is reasonable, and this gives you the opportunity to fine-tune your work. Did you ever think: “I’m going to do something completely different?” Or were you always happy to be in psychometrics? I love what I have been doing and would never want to switch fields. That’s because psychometrics is such a promising field, and it has strong connections to many other fields that require measurements, such as biometrics in biology, econometrics in economics, and behaviometrics in behavioral research. My interest is really in psychometrics. And do you also have an interest in psychology? I have a great interest in psychology. My appointment at UIUC is divided over two departments: psychology and educational psychology. The former belongs to the college of liberal arts and science, and the latter belongs to the college of education. Psychometrics literally means psychological measurement. It was originally invented by psychologists with an intention to better measure all kinds of traits.
Chang, H-H., & Ying, Z. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20, 213–229 4 Chang, H.-H., Mazzeo, J., & Roussos, L. (1996). Detection DIF for polytomously scored items: An adaptation of the SIBTEST procedure. Journal of Educational Measurement, 33, 333–353 5 Chang, H-H., & Ying, Z. (2009). Nonlinear sequential designs for logistic item response theory models with applications to computerized adaptive tests. The Annals of Statistics, Vol 37, No. 3, 1466–1488 3
21 Hua-Hua Chang
279
Can you explain the difference between quantitative psychology and psychometrics? Quantitative psychology is broader, besides psychometrics, and it also includes other disciplines such as decision theory. Psychometrics is the methodology that deals with designs, administrations, and interpretations of measurement on individuals’ constructs such as abilities, attitudes, personality, knowledge, quality of life, learning progress, and so on. Does computer adaptive testing have a practical application? In old days CAT was used to only help testing companies, but now we’re trying to use CAT to help learning. At the University of Illinois, we developed a CAT- empowered smart assessment platform to reduce DFW (students earning a D, Fail, or Withdrawal) rates for a large college gateway STEM class. Funded by the National Science Foundation, our research indicated that CAT could deliver individualized and diagnostic assessments and effectively help students’ progress via their online activities.6 Do you think that CAT is an important contribution of psychometrics to society? Yes, I think it’s a very important contribution. CAT is making a substantial influence on the functioning of society by affecting how people are selected, classified, and diagnosed. One of the newest developments is that CAT can be utilized to facilitate individualized learning. Nowadays many people are taking online courses, like MOOCS, but most online courses are not tailor-made to best fit each individual learner. Though a class can be created by videotaping an extremely experienced instructor, without adaptivity, certain groups would be left behind. Adaptive learning provides the opportunity for quick learners to learn faster, for students who need help to receive help, so that everybody can be successful. As a recent TIME article titled “A is for Adaptive”7explains: “It’s impossible to provide one-to-one teaching on a mass scale, but technology enables us to get closer than ever before.” Technology-powered adaptive learning is clearly a new trend in today’s education. So how long do you think it will take before every school makes use of this method? Schools all over the world are interested in Artificial Intelligence (AI)-powered smart learning. As one of the earliest AI applications in education, CAT will definitely play an important role to accelerate the journey to bring individualized learning to schools. Morphew, J., Mestre, J., Kang, H., Chang, H-H., & Fabry, G. (2018). Using computer adaptive testing to assess physics proficiency and improve exam performance. Physical Review Physics Education Research, DOI: 10.1103/PhysRevPhysEducRes.14.010127 7 Webley, K. (2013). A is for Adaptive. Time Magazine, 181, 40–45 6
280
21 Hua-Hua Chang
Are there limits to computerized adaptive testing? Is it applicable to all types of learning? I think adaptive learning is ideal for all types of learning. With the help of CAT technology, individualized learning routes can be tailored to each student based on his/her progress. So, psychometrics is not only helpful for psychology. There is a booming field in the United States called measuring patient reported outcome. Conventional clinical measures of disease such as X-rays and lab results do not fully capture information about chronic diseases and how treatments affect patients. Today CAT is helping doctors to get information about patients such as well-being, mobility, pain scale, and etc., before the patients enter the clinic. So, psychometrics is helping medicine. You are confident that psychometrics has a future. Totally, with great future. Psychometricians are in high demand, and all my former Ph.D. students had multiple job offers when they graduated. What would you say is the biggest psychometric achievement for science? The biggest achievement of psychometrics is that it provides powerful tools to guarantee that data are collected and analyzed in a meaningful way. Psychometrics is becoming more and more important in the big data era. As indicated by Dr. Xiao-Li Meng at Harvard, if we don’t take data quality into account, the bigger the data, the surer we fool ourselves. When you talk about how data should be collected and analyzed in a meaningful way, do you think that it’s also the job of the psychometrician to always collaborate with other applied researchers? Yes, it is important for psychometricians to collaborate with researchers from other disciplines so that they can understand the nature of the research and develop reliable and valid measurement instruments for the research. What do you think is your most influential work? I believe it’s my presidential address at the 2013 IMPS meeting, titled “Psychometrics behind computerized adaptive testing.8” Though it has not received a very high citation rate yet, it lays out my prediction, my vision on CAT.
Chang, H-H. (2015). Psychometrics behind computer adaptive testing. Psychometrika, 80, 1–20
8
21 Hua-Hua Chang
281
What is your vision? CAT technologies will rapidly transform testing from unaccommodating ranking measures into flexible and informative tools that can be used to address the compelling needs of various stakeholders in education. I also pointed out how we can creatively use CAT to help learning and to make learning more individualized. Who is a psychometrician that really inspired you? First of all, my thesis advisor, Bill Stout, inspired me the most. Also, I was inspired by Fred Lord through reading his books and papers. I admire both of them not only for their scholastic contribution but also for being as late bloomers, becoming psychometricians in their 40s. What do you think is the biggest challenge for psychometrics? With current big data, AI, and deep learning uprising, psychometricians are at a crossroads whether to give up the existing psychometric methods and switch to the fashionable machine learning and deep learning. In fact, many powerful psychometric methods developed from the last 40 years are indeed part of AI applications but only used by a handful of testing companies. So, it is important to accelerate the dissemination of these methods to let more people know what psychometrician can do to address their challenges. If you look at the divisions of sciences, people tend to stick to what they know, so you have the psychometricians, the biologists, and the physicists. Do you think they should collaborate more? Psychometrics is interdisciplinary. At Urbana-Champaign, we are collaborating with physics, statistics, psychology, sociology, education, and computer science. What are your personal plans for the coming years? I want to learn Python so that I can train my students how to develop web-based applications. I do not mean that we should give up on R, we can still use it, but we need to move to the next step. Is there a psychometric problem that you still want to solve? Yes, I am interested in utilizing short-test-length sequential design to detect learning. In an individualized learning environment, it is important to build many reliable short-length assessments to classify students’ mastery levels for any given set of cognitive skills that students need to master. One final question: is there something that you still want to learn, apart from the web-based application? I’d like to explore how to incorporate the methods like deep learning and Bayesian networks into CAT.
Chapter 22
Themes and Visions
In this book, we met with 20 past presidents of the Psychometric Society, who share memories of their career in psychometrics, elaborate on what drives them to do their research, and reflect on the practice and contributions of psychometric research. Though each of these interviews presents one unique perspective and can be read as a stand-alone chapter, when woven together, the interviews show how diverse this niche scientific discipline is. Based on the interviews, we can distinguish several themes that the interviewees consider important in their work as a psychometrician. I will draw three distinctions that denote the diversity and the complexities of psychometric research: the substantive vs. the data-analytic, the theoretical vs. the practical, and the narrow vs. the broad. Lastly, I will discuss how our interviewees experience difficulties with reaching out to other research communities, such as statisticians and psychologists. Along the lines of these distinctions and themes, I will aim to summarize the ideas and visions of our psychometricians and especially point out where and why they stand on opposite sides of the argument.1 Together, these themes illustrate the ideas and visions of our interviewees about psychometrics as a scientific discipline and especially what drives our psychometricians in their work.
The Substantive vs. the Data Analytic Possibly so, one of the strongest sources of disagreement among the interviewees is the use and purpose of psychometric research, especially with regard to whether psychometrics should be more psychology-oriented or more statistics-oriented. In a more psychology-oriented psychometrics, psychometric models are considered
This chapter is a shortened version of Wijsen and Borsboom (2021). Perspectives on Psychometrics: Interviews with 20 Past Psychometric Society Presidents. Psychometrika, 86, 327–343 1
© Springer Nature Switzerland AG 2023 L. D. Wijsen, Twenty Interviews With Psychometric Society Presidents, https://doi.org/10.1007/978-3-031-34858-7_22
283
284
22 Themes and Visions
tools to build substantive theories about human behavior rather than devices to give a statistically sound description of the data—the more statistics-oriented approach. Among our interviewees, there are a number of clear proponents of a psychometrics which is on the frontline of psychological or educational research. Representatives of this line of thinking are people like Susan Embretson (Chap. 10), Klaas Sijtsma (Chap. 20), and Paul De Boeck (Chap. 17). According to Susan Embretson: There’s a new breed of psychometricians who seem to have less substantive background, and I do not think that is a good thing. I think they might be dealing with rather narrow statistical issues that are not really going to make a difference in the discipline or what is being applied in measurement. So, I really see a necessity to keep quantitative methods attached to a discipline so it can influence that discipline.
According to this perspective, psychometricians have knowledge and expertise that is unique and incredibly important for psychological and educational research, and either together with applied researchers or on their own, they can contribute to psychological science. Along similar lines, Klaas Sijtsma argues that: We have a huge toolkit of formal models, and of course we have to continue developing these models and new ones, but I think the main message I would like to give to everybody is: To find sensible applications for the measurement models we developed, we now have to focus on developing sound psychological theories of intelligence, personality traits, and attitudes.
The focus of psychometrics should thus lie not only on developing new, even more statistically advanced, models but also on finding good applications of these models and on how these models can be used to build theories on psychological attributes. According to Paul De Boeck, it is a shame that psychometricians are only employed by psychologists because of their expertise on psychological and educational measurement, and not because of their potential to contribute to understanding human behavior, whereas the latter is certainly an option. Rather than a sole focus on psychological measurement, De Boeck prefers the idea where measurement is a by- product that follows from true understanding of the phenomenon of interest. Defining the nature of, for example, a cognitive ability is incredibly complex, but through the employment of psychometric models and intelligent test design, one could increase our understanding of this ability. According to De Boeck, psychometric models are capable of contributing to psychological theory by increasing understanding of psychological constructs. At the same time, it is undeniable that psychometric models have become exportable products, applicable to a variety of data types, and, in that way, adaptable to all sorts of substantive interpretations (or no substantive interpretation at all). These models do not often specify a theory about a psychological mechanism but can be used for a variety of measurement problems in the social sciences. For example, though the common factor model was once explicitly designed for the measurement of a specific construct (general intelligence), it has been transported to a wide variety of psychological constructs (like personality and psychopathology) and is no longer only linked to intelligence per se. So even though psychometric models can indeed be used for building theories as De Boeck and Sijtsma proclaim,
Theory vs. Application
285
psychometric models can also be considered statistical tools for data analysis of behavioral data. Following this line of thinking, psychometrics can be seen as a specific type of statistics that is (mostly) specialized in latent variable models without a link to a specific substantive domain. This idea is voiced by Brian Junker (Chap. 18): “In a certain sense, by definition psychometrics is tied to psychology, but the methods are really just the methods of latent variable modeling for individual differences, and those may or may not be tied to psychology.” Similarly, Paul Holland (Chap. 7) believes that “psychometrics has a very strong statistical side, I keep thinking of psychometrics as being part of statistics, the ‘metrics’, and not so much the ‘psycho’.” Perhaps, the connection with psychology is now subordinate to the connection with statistics. The relationship between psychometrics and statistics is visualized in Willem Heiser’s metaphor of the river system (Chap. 15). This metaphor describes an intricate system of subdisciplines feeding into the larger river, that is, statistics: A river system starts with small little rivers, and that’s where I see various disciplines, like biology, psychology, economy, econometrics, chemistry. Those are the areas where people do quantitative things. Sometimes they invent something for themselves which is also useful for others, and then these techniques which are invented in a substantive area go down the stream to the big river. The big river represents statistics, so to speak. That’s where everything ends up.
In this metaphor, the smaller quantitative areas like psychometrics and econometrics feed into the larger river of statistics. Methods and models that are developed in these substreams flow, when fully developed and validated by the subfield, into the larger river of statistics and become available to a wider audience. In this sense, psychometric methods can freely flow to other fields, which indeed occasionally happens (e.g., when factor analysis became a well-known research method in chemometrics). The interviews show that what psychometrics entails, and what psychometricians should strive for in their research, is not easily defined. Above all, the interviews show that we should conceive of psychometrics as a field with multiple identities.
Theory vs. Application The pluralist definition continues with the second distinction: the distinction between psychometricians who work on theoretical or foundational issues (such as building psychological theory or proving mathematical theorems) and psychometricians who prefer building applications of these theories. An example of a psychometrician who cares more for fundamental—mostly mathematical—topics than for its applications is Jos ten Berge (Chap. 19), who: did not examine psychological questions other than that I wanted to see how the methods panned out in practice. I worked on factor analysis on intelligence data for instance, to see
286
22 Themes and Visions
how factor analysis worked, how it performed, but not because I’m so interested in intelligence data per se. That’s for other people to worry about.
What was of most interest to Jos ten Berge was how factor analysis panned out, how factor analysis performs on data. But the kind of theoretical questions that can be answered using factor analysis and how factor analysis can be used in applications are not among his interests. It is interesting to note that Jos ten Berge is one of the few who are explicitly devoted to the mathematical side of psychometrics and care less about whether their work has any practical use. While several other presidents are also fascinated by mathematical conundrums and share an admiration for the beauty of mathematics, among whom James Ramsay (Chap. 2) and Bill Stout (Chap. 13), their passion for mathematical psychometrics does not stand in the way of also wanting to contribute to psychometric applications. For Bill Stout, the focus on applications was even one of the reasons he enjoys his appointment at the psychometrics department more than his earlier appointment at the mathematics department: “With mathematics, I’ve always felt this angst because of its disconnect from applications.” The fact that psychometrics combines mathematics with applications made it a more appealing field for Stout than doing purely mathematics. For an outsider like myself, it is quite a challenge to see that some psychometricians in fact work on applications as it seems almost obvious that they tend to work on rather fundamental issues: all the interviewees in this book engage in very technical work, like solving mathematical equations or implementing new models, and are not often involved in developing new assessment procedures or test items— some of the clearer applications of psychometric research. Moreover, psychometricians in general care very much for the statistical and mathematical rigor of their research. However, when given a closer look, we can see that for many interviewees, much of this technical work is in fact in service to applications. Many psychometricians have made contributions to applications such as item banks, software packages, and assessment procedures, and many interviewees consider applications a vital component of their work (though this is of course a matter of degree). For example, Hua-Hua Chang’s research (Chap. 21) focuses largely on computer adaptive testing (CAT) which he considers “[…] making a substantial influence on the functioning of society by affecting how people are selected, classified, and diagnosed. One of the newest developments is that CAT can be utilized to facilitate individualized learning.” The impact of a psychometric invention like CAT seems undeniable, and though not all psychometricians are as determined about finding a social application of their work, making some impact—albeit in the background— does seem a strong motivator. Psychometricians then seem to find a combination between doing highly technical and mathematically sound work on the one hand and finding useful applications of their research on the other hand, thereby improving existing measurement procedures or developing new ones. Note that, as Wim van der Linden points out in Chap. 11, working on theoretical issues does not exclude work on applications; in fact, applications work best if they are based on theory:
Narrow vs. Broad Focus
287
I know a lot of people who pretend not to be interested in theory because they want to be practical. But it’s actually the opposite. If you really want to be practical, you have to go deep theoretically to know what you’re doing and guarantee success.
In that sense, theory and application are never opposites of the same spectrum. The psychometricians in this volume are particularly dedicated to doing statistically and technologically advanced research, and they are certainly not in for a quick sell. Whatever applications they build, whether it is a software package or assessment procedure, should be embedded in a deeper theoretical or statistical framework; the backbone of these applications should be solid. Moreover, many psychometricians would probably also acknowledge the inverse, namely, that through building solid applications, one automatically also contributes to theory.
Narrow vs. Broad Focus An interesting matter that is discussed in the interviews is the boundaries of psychometrics. Where does the jurisdiction of psychometrics end, and what lies beyond the borders of this field? Traditionally, it could be said that contemporary psychometrics (and especially the psychometrics we can read in Psychometrika) roughly consists of three traditions: item response theory, structural equation modeling, and multidimensional scaling. These major traditions have been developed in the second half of the twentieth century and have since remained dominant in psychometrics. Unsurprisingly, most interviewees are or have at one point been important representatives of these traditions. For example, Wim van der Linden and Hua-Hua Chang have done much work in item response modeling, Peter Bentler (Chap. 3) and Bengt Muthén (Chap. 6) have done extensive work on structural equation modeling, and Jan de Leeuw (Chap. 5), Jacqueline Meulman (Chap. 14), and Willem Heiser (Chap. 15) are important representatives of the MDS tradition. However, behavioral research has not stood still. In the meantime behavioral and psychological research has expanded, and there are new techniques on the horizon that could be of use to psychometrics (examples being machine learning, data mining, robot interaction, or even brain measurements). Many of the interviewees have also engaged in research that does not fall under this narrow definition of psychometric research. For example, Ulf Böckenholt (Chap. 16) has done work in marketing research, Paul Holland in social networks, and Jacqueline Meulman has done extensive research in the field of biostatistics. The following questions then arise: What falls under the category of psychometrics and why? And what kind of definition of psychometrics do the presidents uphold? Simply put, whether psychometrics should or should not broaden up to incorporate new techniques and ideas is a point of contention among the interviewees. Several presidents emphasize the importance of testing remaining psychometrics’ core business, like David Thissen (Chap. 12): “[…] testing will continue to develop
288
22 Themes and Visions
and continue to be a thing that is done for placement in education, in jobs. […] I think testing still has some decades, if not centuries in it.” Given that psychometrics still has so much to offer, some interviewees consider these new developments and thus the broadening of the field a danger for the quality and specific expertise of the field. Ivo Molenaar (Chap. 9) believed that it is not always sensible to “collect that many data because it is only going to cause you problems,” referring to the danger of overfitting and the lack of critical thinking in a mostly computer-driven process. Jacqueline Meulman considers it important to be warned against new trends, like big data. Even though big data sound very promising, they are often very noisy and not representative of the population, and the analysis of big data also does not conform with the expertise of psychometricians or statisticians: What we are good at is sampling, we know sampling theory, and we know that it is wise to take samples out of very big data, so we can do much more careful analyses, getting lots of benefits like variation diagnostics at the same time. At least, I adhere to the vision that we in statistics should not go with these people who are implying, ‘my data are bigger than your data!’.
Jos ten Berge also emphasizes that psychometricians should realize they have a unique expertise which is still in demand: The psychometricians need to have their own research topics, evaluating the methods that we have, and that will go on forever. But strictly speaking I don’t see much future for developing completely new methods because there’s so much around that needs calibrating, validating, improving.
Assuming that testing will be around for a long time, psychometricians will always have important work to do, namely, improving and evaluating the methods that we have, which is, given that psychometric measurements are so plentiful in our society, already a substantial task. Therefore, according to Meulman and Ten Berge, engaging in newer methods should not have priority over perfecting the existing ones. On the other hand, several interviewees see the upside and the potential of including these developments in contemporary psychometric research and criticize the narrow definition of psychometrics. For example, several presidents, among whom Bengt Muthén and Larry Hubert, mention that Psychometrika is perhaps too narrow in terms of content. Ulf Böckenholt sees the potential of analyzing big data, especially since these data can now come from so many different sources, which can help identifying “what’s going on in a person’s mind.” In fact, according to Böckenholt, the age of self-quantification that is now upon us should be the dream of the psychometrician, as it can offer so much more knowledge about human behavior than testing does. The psychometrician in that sense should be interested in the quantification of human behavior in a broader sense than what can be achieved through traditional psychometric measurement. As Paul Holland puts it: The future of psychometrics is about the open-mindedness of all the different varieties of the ways that people collect data and try to draw conclusions, and to make sense of it. But this underlying theme, which I think goes back to Spearman, this notion that there’s something there, you try to measure it, but you measure it kind of poorly, or with uncertainty, that stays.
PR Problems
289
PR Problems The interviews provided a brilliant opportunity not only to look at modern practices in psychometrics but also to shed light on future developments. Related to the Narrow vs. Broad paragraph is the question whether psychometrics should elaborate its scope in the coming years or stick to what it knows best, namely, psychological measurement (in all its shapes and forms). I expect that preserving and improving already existing psychometric methods is something all interviewees consider important for psychometric research, but several psychometricians also see opportunities for psychometrics to go beyond the traditional boundaries of psychometric research. In fact, as we have seen, some interviewees (like Paul Holland and Ulf Böckenholt) think that the only way forward is engaging in a wider range of methods and research topics. There are several thinkable reasons why broadening the concept of psychometrics to all these other methods would be relevant: it provides new areas of research, it is fun and interesting, and it is important to keep up with modern trends. What is remarkable is that many interviews point out a completely different reason, namely, that psychometrics currently suffers from a PR problem: psychometrics has trouble reaching out to psychologists and to statisticians. Psychometric expertise is simply difficult to sell to people outside of its disciplinary boundaries: to the psychologists, psychometric methods are highly technical, too complex, and quite detached from substantive theory, and to the statisticians, psychometrics is merely a small subarea of statistics and therefore perhaps easily overlooked. According to Wim van der Linden (Chap. 11), the inability of psychometrics to sell itself can be considered psychometrics’ biggest pitfall. According to Van der Linden, if psychometrics had engaged more in making good user-friendly software, for example, psychometrics might not have had so many issues connecting to applied researchers. However, the PR problems of psychometrics seem to be more far reaching than that. Not only is the work of psychometricians complex and therefore difficult to sell to more applied researchers (and psychometricians might not have made enough efforts to do so), but psychometric research may also spark some skepticism when encountered by scholars from other disciplines, especially the more technical fields. Robert Mislevy (Chap. 8) describes his collaborations with others as the following: “it is easier to get people to recognize the value and the use of psychometric techniques if you do not call them psychometric techniques until you have worked with them for a couple of months at least!.” So, there may be something about psychometrics that does not immediately speak to people from other disciplines. Bengt Muthén figures this might be because psychometrics (wrongfully) is not always taken seriously. Whether psychometric knowledge is taken seriously also depends on how psychometric knowledge is framed: Statistics is not that fond of factor analysis and structural equation modeling; statisticians think of that as hocus pocus machinations. But if you present it as latent variable dimension reduction thinking, then it’s similar to what the statisticians write about in biometrics for instance.
290
22 Themes and Visions
Even though many psychometricians engage in rigorous, statistically sound psychometric research, the psycho-prefix might cause some hesitation on the part of statisticians. Our focus on latent variables, being invisible unobservable quantities that might or might not exist, does not directly appeal to statisticians, who usually lack such a substantive focus and whose work is mostly data-oriented. The presidents mention several solutions to psychometrics’ PR problems, which are aligned to their narrow or broad conception of psychometrics. Some emphasize that we should not miss the boat now: psychometrics has been too conservative in the past and should now widely engage with new methods so that the unique expertise of psychometricians can be put to good use. Others remain more conservative and stress that psychometricians should stick to what they know best, which is psychological measurement. There is something to say for the latter strategy: testing and measurement still seem to be in high demand, and there is still plenty of work to be done in that area. Interestingly, both sides of the argument fear that psychometrics might remain too isolated and out of touch with the scientific playground, showing that this is a big concern for the field.
Conclusion This book provides a unique insight into the perspectives of psychometricians on their own field. The chapters show that psychometrics is still very much alive and is made up of a wide range of thoughts and ideas, not only on psychological measurement but also on what lies beyond the borders of psychometrics. What the psychometricians share is a passion for the field, but the reasons for their passion certainly vary from president to president: where the one president is more mathematically inclined, others use psychometrics to support psychological science, and where some prefer to reach out to new and modern techniques that widen the scope of psychometrics, others prefer to stay close to the more traditional boundaries of psychometrics. Though all of them share the vision that psychometrics remains highly relevant, their concern for the field is also crystal clear: psychometrics should not be left in the dust, and the psychometric community should seek out avenues to prove its relevance. Perhaps this book can prove to be a small contribution towards that goal. My hope is that this book will contribute to the visibility of the field and especially in preserving the testimonies of some of its main protagonists.