234 9 7MB
English Pages [296] Year 2023
Statistics for Applied Behavior Analysis Practitioners and Researchers
Critical Specialties in Treating Autism and Other Behavioral Challenges Series Editor Jonathan Tarbox
Statistics for Applied Behavior Analysis Practitioners and Researchers David J. Cox Behavioral Data Science Research Lab, Endicott College, Beverly, MA, United States RethinkFirst, New York, NY, United States
Jason C. Vladescu Department of Applied Behavior Analysis, Caldwell University, Caldwell, NJ, United States The Capstone Center, Caldwell, NJ, United States
Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2023 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. ISBN: 978-0-323-99885-7 For Information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals
Publisher: Nikki Levy Acquisitions Editor: Joslyn T. Chaiprasert-Paguio Editorial Project Manager: Barbara Makinster Production Project Manager: Punithavathy Govindaradjane Cover Designer: Mark Rogers Typeset by MPS Limited, Chennai, India
CONTENTS
About the series editor......................................................................... ix About the authors ................................................................................ xi Series Foreword: Critical Specialities in Treating Autism and Other Behavioral Challenges ............................................................. xiii Preface................................................................................................xv Chapter 1 The requisite boring stuff part I: Defining a statistic and the benefit of numbers....................................................1 Introduction..........................................................................................1 Defining a statistic ................................................................................3 The benefits of numbers .......................................................................4 Models and model building ..................................................................6 Common myths and misconceptions about statistics............................9 Statistics in applied behavior analysis.................................................12 Chapter summary ...............................................................................17 References...........................................................................................18 Chapter 2 The requisite boring stuff part II: Data types and data distributions................................................................21 Introduction........................................................................................21 Data types ..........................................................................................24 Data distributions ...............................................................................34 Quick recap and resituating ourselves.................................................44 References...........................................................................................47 Supplemental: Probability distribution equations ...............................48 Chapter 3 How can we describe our data with numbers? Central tendency and point estimates..................................51 Introduction........................................................................................51 What is next on the agenda? ...............................................................52 High-level overview of the why and the how ......................................53 Common descriptions of central tendency ..........................................55 Examples of how reporting central tendency in applied behavior analysis can be fun..............................................................................62
vi
Contents
Percentage...........................................................................................66 Chapter summary ...............................................................................70 References...........................................................................................72 Chapter 4 Just how stable is responding? Estimating variability..........75 Introduction........................................................................................75 Describing the spread of your data.....................................................76 Describing how well you know your measure of central tendency .....82 Other flavors for describing variability in your data...........................89 Choosing and using measures of variance in applied behavior analysis ...............................................................................................94 References...........................................................................................96 Supplemental: Why square the difference, and square root the final measure? ...............................................................................96 Chapter 5 Just how good is my intervention? Statistical significance, effect sizes, and social significance .................99 Introduction........................................................................................99 Statistical significance ....................................................................... 102 Effect sizes ........................................................................................ 112 Social significance............................................................................. 126 Chapter summary ............................................................................. 129 References......................................................................................... 130 Chapter 6 Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables .................................... 135 Introduction...................................................................................... 135 Situating this chapter in the broader analytic landscape................... 136 Models of behavior........................................................................... 139 Brief primer on interpreting models.................................................. 154 Chapter summary ............................................................................. 171 References......................................................................................... 172 Chapter 7 How fast can I get to an answer? Sample size, power, and observing behavior .......................................... 175 Introduction...................................................................................... 175 Enough is enough ............................................................................. 176 Is the intervention effect I’m seeing real? .......................................... 189 How do I know if I’ve accounted for the variables’ multiply controlling behavior? ........................................................................ 191
Contents
vii
When to decide when to stop: Variations on a theme....................... 193 Chapter summary ............................................................................. 194 References......................................................................................... 195 Chapter 8 Wait, you mean the clock is always ticking? The unique challenges time adds to statistically analyzing time series data................................................. 199 Introduction...................................................................................... 199 Statistical analysis of time series data for single-case designs ........... 203 Structured criteria ............................................................................. 203 (Non)Overlap statistics ..................................................................... 208 Effect size measures .......................................................................... 211 Regression and classification modeling............................................. 213 Nested approaches to modeling ........................................................ 216 Chapter summary ............................................................................. 219 References......................................................................................... 222 Chapter 9 This math and time thing is cool! Time series decomposition and forecasting behavior ............................ 225 Introduction...................................................................................... 225 Time series analyses through a different lens .................................... 227 Time series decomposition ................................................................ 229 Forecasting behavior ........................................................................ 239 Chapter summary ............................................................................. 247 References......................................................................................... 247 Chapter 10 I suppose I should tell someone about the fun I’ve had: Chapter checklists for thinking, writing, and presenting statistics......................................................... 251 Introduction...................................................................................... 251 Checklist and questions to answer when writing/presenting about statistics .................................................................................. 253 References......................................................................................... 257 Chapter 11 Through the looking glass: Probability theory, frequentist statistics, and Bayesian statistics ................... 259 Introduction...................................................................................... 259 It’s assumptions, all the way down ................................................... 261 Probability theory ............................................................................. 264 Frequentist approach........................................................................ 269
viii
Contents
Bayesian approach............................................................................ 271 Chapter summary ............................................................................. 275 Closing thoughts ............................................................................... 276 References......................................................................................... 277 Index................................................................................................. 279
ABOUT THE SERIES EDITOR
Jonathan Tarbox (he/they), PhD, BCBA-D, is a professor and mindfulness researcher at the University of Southern California, as well as Director of Research at FirstSteps for Kids. His life's work involves research and practice in areas that help people thrive during times of stress and discomfort. Dr. Tarbox has published five books in psychology, over 90 scientific articles and chapters, and has served as Editorin-Chief of the scientific journal Behavior Analysis in Practice. His practical work revolves around supporting children and families, as well as teaching adults skills that help us connect with deeper meaning and purpose in the context of life's struggles. Compassion and social justice are the compass that guide Dr. Tarbox's work. Dr. Tarbox is proud to have multiple neurodivergent family members and is working hard to become a more effective ally to the Autistic community.
ABOUT THE AUTHORS
David J. Cox, PhD, MSB, BCBA-D, is currently working as the VP of data science at RethinkFirst and is faculty at the Institute for Applied Behavioral Science at Endicott College. Dr. Cox has earned an MS in bioethics from Union Graduate College, a PhD in behavior analysis from the University of Florida, a postdoctoral fellowship at the Behavioral Pharmacology Research Unit of Johns Hopkins University School of Medicine, and a postdoctoral fellowship at Insight! Data Science. Since 2014, Dr. Cox’s research and applied work has focused on how to effectively leverage technology, quantitative modeling, and artificial intelligence to ethically optimize behavioral health outcomes and clinical decision-making. Based on his individual and collaborative work, he has published more than 50 peer-reviewed articles, three books, and more than 150 presentations at scientific conferences. Jason C. Vladescu, PhD, BCBA-D, NSCP, LBA (NY), is a founding partner at The Capstone Center and professor in the Department of Applied Behavior Analysis at Caldwell University. Jason completed his predoctoral internship and postdoctoral fellowship at the University of Nebraska Medical Center’s Munroe-Meyer Institute. He has published more than 80 peer-reviewed articles and book chapters spanning his research interests in early behavioral intervention for children with autism spectrum and related disorders, increasing the efficiency of academic instruction, staff and caregiver training, equivalence-class formation, and mainstream applications of behavior analysis. Jason is on the Science Board of the Association for Behavior Analysis International, the president-elect of the New Jersey Association for Behavior Analysis, a member of the Autism Advisory Panel for the New Jersey Department of Education, and is a current or former associate editor for Behavior Analysis in Practice and Journal of Applied Behavior Analysis. He currently or previously served on the editorial board for Behavior Analysis: Research and Practice, Behavioral Interventions, The Analysis of Verbal Behavior, The Psychological Record, School Psychology, Behavioral Development, and the Journal of Applied Behavior Analysis. He was the 2020 recipient of the APA (Division 25) New Applied Researcher Award.
SERIES FOREWORD
Critical Specialities in Treating Autism and Other Behavioral Challenges Purpose The purpose of this series is to provide treatment manuals that address the topics of high importance to practitioners working with individuals with autism spectrum disorders (ASD) and other behavioral challenges. This series offers targeted books that focus on particular clinical issues that have not been sufficiently covered in recent books and manuals. This series includes books that directly address clinical specialties that are simultaneously high prevalence (i.e., every practitioner faces these problems at some point) and yet are also commonly known to be a major challenge, for which most clinicians do not possess sufficient specialized training. The authors of individual books in this series are toptier experts in their respective specialties. The books in this series will help solve the challenges that practitioners face by taking the very best in practical knowledge from the leading experts in each specialty and making it readily available in a consumable, practical format. The overall goal of this series is to provide useful information that clinicians can immediately put into practice. The primary audience for this series is professionals who provide support and education to the Autistic community. These professionals include Board Certified Behavior Analysts (BCBAs), Speech and Language Pathologists (SLPs), Licensed Marriage and Family Therapists (LMFTs), school psychologists, and special education teachers. Although not the primary audience for this series, parents and other caregivers for Autistic people will find the practical information contained in this series highly useful. Series Editor Jonathan Tarbox, PhD, BCBA-D University of Southern California and FirstSteps for Kids, CA, United States
PREFACE
Wherever there is number, there is beauty.
Proclus
Scientists love data. To make sense of data, you need numbers. And, to make sense of many numbers, you need mathematics. Practical mathematics do not need to be complicated. Simple mathematical functions help us balance our bank accounts and predict the cost of our dinner. But the world is often a complicated place and, sometimes, more complicated mathematical functions are better at helping us describe the universe around us. This book is about one area of mathematics—statistics—and how it can be practically used to describe the wondrous relations between behavior and the environment. Before you dive into the content and immerse yourself in these cool waters, it’s important to understand what this book is, what it is not, and the reasons we chose to write it in the first place.
What this book is (and isn’t) This book is an introduction to statistics for behavior analysts. Behavior analysts are scientists and, assuming the first paragraph is true, they also love data. We use data for just about every decision we make—from understanding behavior contextually—to answering whether our intervention is effective—to seeking to convince others to continue funding our services or research. Data are the lifeblood of our work. Despite our love of data, we don’t always use it to its fullest capacity. Like simple addition and subtraction in finance, graphical displays depicting single-case experimental designs are the analytic workhorse of modern behavior analysis. They get the job done well enough, they are historically reliable, and they allow us to discuss our work easily with other behavior analysts. But you can likely do more with your data than you realize. There are patterns of behaviorenvironment relations waiting to be uncovered, knowledge about the universe yet to be described, and ways of helping people that can be illuminated if we take the time to play creatively with our data.
xvi
Preface
Statistics are one way to do this. And, by creating single-case experimental design graphs, you are already behaving statistically even if you didn’t know it! Yet, it’s possible to do statistics illogically. Statistics are verbal behavior constrained by the field of mathematics which are further constrained by the field of logic. Thus, at its core, there are rules people must follow when doing statistics for its underlying logical system to be, well, logically coherent. This book attempts to help outline how some of this logic can be applied to the work behavior analysts do. Specifically, the book is organized sequentially around the decisions behavior analysts make when collecting data, turning their observations about behavior and the environment into numbers, aggregating them within and across clinical or experimental sessions, and then making sense of the behavior-environment relations they examine. Each of these steps involves numbers, statistics, rules around how to do it logically, and—perhaps most excitingly—additional ways that behavior analysts might creatively squeeze more juice out of their data. As an introductory text, the primary goal is to introduce behavior analysts to the role of statistics in the decisions they make every day. As an introductory text, this book is also designed for those who are new to the language of statistics or who pushed their statistical knowledge to some long-forgotten corner of their behavioral repertoire. We tried to cover the material as simply as we were capable of and to focus more on helping behavior analysts merge the way they think about the world with the language of statistics. As a result, this book does not get into advanced statistical modeling, statistical simulations, optimization procedures, machine learning, or any of the other hot statistical topics of our day. If these are the topics you are looking to dance with, this is not the book for you. Similarly, a note of framing for the statisticians who felt the need to pick up this book for some odd, strange reason. The focus of this book is how statistics can be used as a tool by behavior analysts. Thus, we described how behavior analysts can wrap statistics around what they do. (i.e., statistics play a supporting role). We purposely did not try to use statistical theory to change what behavior analysts do. We feel statistics are merely a one instrument in the scientist’s toolbelt, rather than their guiding light; and we wrote the book accordingly. Lastly, we would be remiss not to comment about the tone of the book. Mathematics, statistics, and logic can be intimidating and tough
Preface
xvii
sledding for the initiated and uninitiated alike. As Steven Hawking is claimed to have stated, “Someone told me that each equation I included in the book would halve the sales.” If true, then any book on statistics is likely to have only a handful of readers (often the authors’ parents and significant others). To turn this drudgery to—gasp!—fun, and (we hope) increase readership, we attempted to employ a more informal, engaging, and—in some instances dare we say—a humorous tone. As academics, we know this is sometimes looked down upon. But we’re more interested in function over form. And it’ll be worth it if we can get even a small portion of behavior analysts feeling comfortable using the beauty of mathematics to creatively explore their data.
Onward, Ho! Regardless of where you are at in your comfort using mathematics and statistics, we sincerely hope you find this book useful in your research and practice. We’re also really excited to see the unique, creative, exciting, and thought-provoking patterns you uncover about behavior-environment relations as you gain fluency in using statistics to play with your data. Without further ado, let’s get to the good stuff. David J. Cox Jason C. Vladescu
CHAPTER
1
The requisite boring stuff part I: Defining a statistic and the benefit of numbers In God we trust. All others must bring data. Robert Hayden
Introduction People seem to have one of two reactions to the word, “Math.” In both situations, the pupils dilate and pulse quickens. For some, this happens from dread; and, for others, from pure delight. Regardless of how you’ve found this book, we hope by the end you’ll always fall in the latter camp. After all, math and statistics are just verbal behavior so there really isn’t anything to be afraid of once you learn the language. We’re excited you’re here to learn about what math and statistics have to offer as one method for describing environment-behavior relationships. We intentionally sought to craft a text that is different from previous statistics books you may have encountered. To begin, we wrote this book for applied behavior analytic practitioners and researchers1. The field of applied behavior analysis (ABA) is newer to the inclusion of certain statistical practices in the published literature. Many traditional statistics books begin with statistical theory and attempt to sprinkle in relevant examples along the way. Rather than approaching each topic through the lens of statistical theory, we attempt to approach each topic through a practical lens by asking: Why conduct these analyses in the first place? Under what conditions might this way of thinking about our data be useful? And, under what conditions might this way of thinking about our data add little?
1
Hereafter we’ll use the term behavior analyst to broadly refer to individuals functioning as clinicians, researchers, or both.
Statistics for Applied Behavior Analysis Practitioners and Researchers. DOI: https://doi.org/10.1016/B978-0-323-99885-7.00007-6 © 2023 Elsevier Inc. All rights reserved.
2
Statistics for Applied Behavior Analysis Practitioners and Researchers
This book is also different in that we center the book around practical decisions that behavior analysts make in their work. Sometimes, statistics books try to force research methodologies and clinical questions into a sequence of statistical theory. For example, the book may start with an introduction to probability theory, use the classic “black and white balls pulled from an urn” example, and then try to make a parallel with examples in healthcare or business that fit into that a similar framework. Anyone who has experienced the disgruntled statistician, lamenting the fact you didn’t come to them first and lecturing you about how you should design your experiment around the eventual statistical tests is likely familiar with this way of thinking: Use statistical theory to guide what you do. But that’s rarely how real-world decisions are made. No knock to any strategic planning dynamites out there but, in our experience, many applied decisions often arise in the absence of a neat, tightly controlled experimental arrangement. Often, we have a question about a behavior-environment relationship or some challenge we are trying to solve; we have some set of related data that varies in its usefulness; and we need to come to an informed conclusion using those data that allows us to make the best decision we can. So, throughout this book, we attempt to approach each topic framed around the questions with which the topic is likely to coincide. Lastly, behavior analysts primarily collect and use a very specific type of datasets (i.e., within-subject time series datasets). As you’ll see in the following chapters, these types of datasets have unique characteristics that violate some of the assumptions we might often make about behavioral processes. Once pointed out, a host of interesting questions arise for behavior analysts that might be hard to shake. Heavy use of within-subject time series datasets is somewhat unique among health professionals2 and so we dive into it out of pure necessity and to raise some questions the field of ABA may want to (re)consider. But, before we can get there, we’ll follow the advice of the King when the White Rabbit asks him where to begin: “Begin at the beginning and go on till you come to the end.” So to the beginning we go. 2
A caveat is in order—unique for clinical professions, perhaps where randomized controlled trials are often considered the gold standard. But, many other professions use time series data (e.g., finance, marketing) and have been developing sophisticated methods for analyzing these types of data quantitatively. We’ll chat later in the book about how behavior analysts might use some of these techniques to supplement visual analysis.
The requisite boring stuff part I: Defining a statistic and the benefit of numbers
3
Defining a statistic Statistics have a historically bad reputation in behavior analysis. For example, Skinner (1938) notoriously eschewed statistical analysis of behavior-environment relations in favor of experimental approaches. The basic idea was that few, if any, things in the universe remain constant and unchanged. As behavior analysts, we often observe variability in behavior across time, contexts, and people. As such, Skinner argued scientists should use experimentation to understand the functional determinants of variability rather than mathematically controlling for variability through statistical techniques (Skinner, 1956). That early skepticism of statistical analysis has carried through the field to the present. The resulting dogma around “statistics” being a dirty word has, thus, influenced many behavior analysts and the training they receive during their education and practicum experiences. As a result, behavior analysts who use the word “statistic” within their published manuscript or in conversation with colleagues can sometimes be treated as less than pure. But, as noted by many over the last several decades (e.g., Branch, 2014; Ioannidis, 2005; Young, 2018), the logical challenge to which Skinner spoke occurs with a specific set of procedures called null hypothesis significance testing (NHST). We will talk in depth about this later on. But what’s important to note is that NHST comprises one small branch of the large field that is statistics. Ironically, even professional statisticians have decried their overuse in many areas of science (e.g., Wasserstein & Lazar, 2016). And, so what is “statistics”? Statistics is a branch of mathematics whose topic is the collection, analysis, interpretation, and presentation of aggregate quantitative data (Merriam-Webster, 2021). Hey—wait a second! Behavior analysts collect, analyze, interpret, and present aggregate quantitative data all the time. Does this mean that behavior analysts regularly use statistics? Yes, they do. In fact, behavior analysts use statistics so frequently that we wrote a book about it to provide guidance for those with less formal training (and because it’s fun to talk about). Just about every graph used in practice or displayed in published behavior analytic journals and books includes statistics. For example, the percentage of responses that are correct is a single number that describes the aggregate of correct responses divided by the aggregate
4
Statistics for Applied Behavior Analysis Practitioners and Researchers
of opportunities available to respond. As a second example, response rate does not provide you with information about any single response that occurred during a session. Rather, response rate aggregates all the occurrences of a specific response topography, or set of topographies, into a single number where we control for time. Even the cumulative record shows the aggregated count of responding as a function of time. Percentage of responses that are correct, response rates, and the cumulative number of responses within some time period are—by definition—statistics.
The benefits of numbers So why use statistics? It turns out that the everyday languages used by humans are comparatively imprecise. For example, consider the daily difficulty you would encounter if numbers were not used to describe the weather outside, the speed with which your car travels, or the time of day. Life would likely look very different than it does now. The weatherperson’s description of temperature would likely be relative to their experience and may not help us dress comfortably. Safely slowing down for blind curves and for pedestrians in school zones might be difficult without numbers to guide the ideal speed. And, scheduling meetings would be very tricky for those who work with people spanning time zones or when weather conditions make the position of the Sun ambiguous. In each instance, statistics provide a precise and useful aggregate description of environmental events through the use of numbers. The utility of conveying more information, more precisely, and using less words is particularly useful to scientists (e.g., Dallery & Soto, 2013). Modern science leverages a theoretically simple and intuitive set of procedures. Identify something in the universe (independent variable [IV]) that you think influences something else (dependent variable [DV]). Systematically present, remove, or vary that IV while trying to hold everything else constant. Then, use tools of some kind to observe and measure the degree to which the IV is present, the degree to which the DV is present, and how well you’ve held everything else constant. Numbers and statistics give us the means to capture these observations in a few lines, paragraphs, or pages rather than requiring our clinical or lab notebook to explode into hundreds or thousands of pages. And, because the verbal stimuli referred to as numbers have a similar or
The requisite boring stuff part I: Defining a statistic and the benefit of numbers
5
identical basic function for people the world over, it makes it very easy to communicate those observations accurately and precisely. Behavior analysts benefit from using numbers and statistics, too. (After all we, too, are scientists.) Behavior analysts’ subject matter expertise is the behavior of the individual, and we analyze how patterns of changes in the environment correspond with changes in behavior. By conducting analyses using numbers, we can describe, predict, and control the behavior of an individual (e.g., Catania, 1960; Cooper et al., 2020)—quite the challenging task. Behavior, by definition, is always occurring and always changing. The environment includes dozens, hundreds, or maybe thousands of environmental events that also are always occurring and always changing. Numbers and statistics help behavior analysts more succinctly describe and communicate the occurrence and degree of behavior and environmental stimuli, how behavior-environment occurrences covary, and what behavior we predict we will observe if we change the environment in a particular way. The following brief thought exercise may help demonstrate how the benefits of numbers and statistics extend to the professional practice of behavior analysis. To start, think of an intervention you recently implemented with a client or an experimental preparation you recently conducted. On the lines below, describe of the effects of that intervention or your experimental preparation using only descriptions of moment-by-moment occurrences (or nonoccurrences) of the behavior, the programmed (or unprogrammed), antecedent and consequent stimuli, and without using numbers:
Compare what you wrote with the following, “Aggression decreased from ten responses per minute to one response per month by the end of the intervention.” We obviously have no clue what you wrote down. But we’re (err, well, at least David is) willing to wager a drink at the
6
Statistics for Applied Behavior Analysis Practitioners and Researchers
next behavior analytic conference that the 18-word phrase we used is more precise and uses fewer words. Statistics are the culprit. Statistics allow us to easily aggregate information from many observations while maintaining some degree of precision of those observations. In short, statistics typically allow us to convey more information using less words.
Models and model building Another important idea to sciences, generally, and the use of statistics, specifically, is the notion of models. A model can be defined as a miniature representation of something (Merriam-Webster, 2022). Said differently, models describe relationships between variables. Similar to statistics, we use models regularly even though we may not use the word “model” in our everyday behavior analytic vernacular. For example, the fundamental unit of analysis in behavior analysis, the three-term contingency, is a model (Moxley, 1996; Skinner, 1938, pp. 178 179). Models help researchers specify specific broader patterns in our observations so that they can be tested more directly and improved upon. The three-term contingency is one way that behavior analysts describe the relationships between events that occur before a behavior (antecedents), the behavior that we’re particularly interested in, and the changes in the environment that follow a behavior (consequences). But it’s important to remember that the three-term contingency is not a physical thing with a physical existence in the universe. The threeterm contingency is a miniature verbal representation of the totality of the environment surrounding behavior unfolding in time. Stated differently and more eloquently by the physicist Richard Feynman, “If our small minds, for some convenience, divide this. . .universe into parts. . .remember that nature does not know it!” (Feynman, 1965, para. 34). Other examples of models in behavior analysis include the four-term contingency (e.g., Michael, 1982), the discounting equation (e.g., Mazur, 1986; Rachlin, 2006), and the generalized matching equation (e.g., Baum, 1974; Baum & Rachlin, 1969). So why build models? The short story seems to be about efficiency and utility. Rarely are scientists interested in simply describing the things they observe. Often, scientists want to know why something happens. As many behavior analysts are likely familiar with,
The requisite boring stuff part I: Defining a statistic and the benefit of numbers
7
understanding why something occurs requires repeated observation and experimentation with the variables relevant to our phenomenon of interest. But the universe is large and vast and we are unable to observe and measure everything. Models—built up over time through experimentation and scientific communication—help the model user to focus their efforts on measuring and manipulating only the IVs that actually matter (efficiency). And, these miniature verbal descriptions of the universe only survive to the extent that they effectively help the model user understand, predict, and control the phenomenon they’re interested in (utility). Models can be described in various categorical ways. Fig. 1.1 shows these different ways to think about models and how different models from the published behavior analytic literature fit this schematic. One way to categorize models is based on the function of the modeler’s behavior (x-axis of Fig. 1.1). Here, we can use the commonly espoused continuum of scientific understanding: description, prediction, and control (e.g., Cooper et al., 2020; Skinner, 1938). For behavior
Figure 1.1 Types of models in behavior science.
8
Statistics for Applied Behavior Analysis Practitioners and Researchers
analysts, descriptive models serve the function of describing a pattern of behavior-environment relationships as precisely as possible. As we move away from descriptive models, we start to get toward models that provide causal explanations of behavior-environment relationships. Here, we start to play with models that allow us to predict whether behavior will occur and to what degree (i.e., predictive models). And, arguably, the most useful causal models are those that allow us to prescribe courses of action for the model user (i.e., prescriptive or decision models). A second way to categorize models is based on the topography of the modeler’s behavior (y-axis of Fig. 1.1). Here is where the most variability might be observed and the most opportunity for future creativity might lie. For example, we can simply use written or spoken words to describe a likely causal relationship between environmental variables and behavior (e.g., the three-term contingency). We might visually or graphically display information to describe the relationship between the environment and behavior (e.g., reversal design plot contrasting baseline with intervention). We may use mathematical expressions to describe the relationship between the environment and behavior (e.g., matching law; Baum, 1974; McDowell, 2005). Or, as a final example, we might even use metaphorical language to highlight the relationship between the environment and behavior (e.g., response strength; Skinner, 1938, 1953, 1974). Note here that the topography is not relevant per se. All of these are instances of verbal behavior (Cox, 2019; Marr, 2015). And, as we know about verbal behavior (e.g., Skinner, 1957), topography is less important than the function that response serves for the speaker and listener (e.g., describe, predict, control). The role of statistics becomes clearer when we combine the function and topography of the modeler’s behavior. Generally speaking, the use of numbers and quantitative relations between the environment and behavior allow for greater precision in our descriptions, predictions, and control of behavior (e.g., Dallery & Soto, 2013). Statistics are one category of verbal behavior that behavior analysts use to precisely describe and predict interactions between the environment and behavior. If this all still sounds a bit vague, no worries—later in this chapter we provide examples of statistics in the wild of which behavior analysts are likely familiar. But, before we get there, it’ll likely help to talk about what statistics are not.
The requisite boring stuff part I: Defining a statistic and the benefit of numbers
9
Common myths and misconceptions about statistics At this point in the chapter, we think it is useful to pick our head up out of the book and wave “hello” to the elephant in the room. As described previously, statistics have had (and in some circles likely continue to have) a notorious reputation in behavior analysis, often associated with “the other” researchers who were not behavior analysts. Thus, likely through basic respondent conditioning procedures, several superstitious pairings may have occurred and now exist formally as a conflation for some. Two of the most common center around how scientists go about conducting research: group design vs. within-subject design studies and inductive vs. deductive reasoning.
Myth #1: Statistics 5 group design research Perhaps one of the most common conflations is equating statistics with group design research. It is true that many researchers have published studies within the biological, behavioral, social, and medical fields where the researchers compare the effects of an IV between two or more groups of participants. It is also true that many of the researchers conducting those studies have used NHST to analyze the data they collected through the study. But, how one chooses to expose participants to an IV is different from how one handles and analyzes the data on the backend. NHST can and has been used to analyze within-subject data (see the pages of the Journal of Applied Behavior Analysis and the Journal of the Experimental Analysis of Behavior). It is also true that researchers employing group designs can and have analyzed their data without using NHST. In short, the research design chosen does not, by necessity, require the researchers to use NHST. That is simply one of many ways to describe one’s data. And, as noted earlier, statistics involve many different ways of aggregating, describing, analyzing, and talking about data. Researchers employing group design and within-subject research also use statistics other than NHST. As described previously, the percentage of responses correct, responses per minute, average latency to response, and the average allocation of behavior among alternatives across many sessions are all statistics that have been successfully employed by behavior analysts. Such descriptive statistics are also very common in research employing group designs. For example, tables that provide counts and average demographic information about the participants are describing their data in aggregate—descriptive
10
Statistics for Applied Behavior Analysis Practitioners and Researchers
statistics. And, the new era of “Big Data” is ripe with large, group comparisons that forego traditional NHST because at sample sizes in the tens of thousands and millions, almost everything is statistically significant. As a result, the field of data science which leverages “Big Data” places greater emphasis on socially significant differences and visual analysis (e.g., Simpao et al., 2015; Steed et al., 2013). To summarize this myth, researchers that use group designs and within-subject designs have used both descriptive and inferential statistics. The decision around how one aggregates, describes, analyzes, and communicates about the data they have collected is determined by the question being asked, the possibilities and limitations of the data on hand, and the audience with whom one is communicating. In short, statistics do not equate to group design research.
Myth #2: Statistics 5 deductive reasoning 5 bad Historically, authors within behavior analysis have made significant distinctions between approaching science and one’s data using deductive reasoning compared to using inductive reasoning. Deductive reasoning within philosophy of science and as related to research is often described as instances wherein a researcher follows three steps in sequence. First, they describe theoretically how manipulating an IV should influence a DV based on what past research suggests about the IV and DV of interest. Second, the theoretical account guides the researcher to emit a hypothesis about something not yet known but that logically follows from the theory (i.e., the research question). Finally, a study is designed and conducted to directly test that hypothesis. Differently stated, deductive reasoning might refer to conclusions being drawn from a logical chain of reasoning in which each step in the sequence follows necessarily from the previous step (Ennis, 1969). Inductive reasoning within philosophy of science and as related to research is often described as instances wherein a researcher also follows three steps in sequence. First, they are curious about the effect of an IV on a DV. Second, the researcher designs an experiment to systematically control for the degree of presence or the absence of that IV while measuring the DV. Finally, they use the resulting data along with past research to understand theoretically how everything fits together. Differently stated, inductive reasoning might refer to drawing a generalized conclusion from particular instances of observation (Mish, 1991).
The requisite boring stuff part I: Defining a statistic and the benefit of numbers
11
The previous two paragraphs are more similar than they are different. In both situations, the researcher or practitioner is interested in and measures the direct effect of an IV on a DV; their interest in that IV is informed by past research and published information about the topic; if the likely influence of the IV on the DV didn’t logically follow published research, it would be unlikely to get through an Institutional Review Board or be considered a good practice; and the particular instances of data that are gathered are used to draw generalized conclusions about how the world works. When described in this way, it is easier to see that framing research and data analysis as inductive vs. deductive reasoning is often a false dichotomy (i.e., a situation where someone makes a claim that only one of two [or several options] must be true and the rest false). Here, the logically fallacious claim is that a researcher or practitioner can engage in only deductive reasoning or they can engage in only inductive reasoning—but not both. This is logically false for at least two reasons. The first reason is because behavior is always embedded in a larger context, necessarily follows past behavior, and necessarily influences future behavior. Stated differently, it seems unlikely that research or practice questions are chosen in isolation from all past published research (i.e., chosen irrespective of a theoretical interpretation of the physical universe). And, it seems unlikely that researchers or practitioners would ever fail to learn from the data they have collected following an experiment and update how they think about the world. Thus the claim that people using inductive reasoning avoid theory until after data has been collected is pure madness. And, the notion that individuals using deductive reasoning fail to use their data to derive conclusions about how the world might work (as opposed to simply updating some theory in their head) is also pure madness. The second reason that framing the debate as inductive vs. deductive reasoning is false is because it ignores that other types of reasoning are possible. For example, some have argued that transformational reasoning is a chain of behaviors different from inductive and deductive reasoning (Simon, 1996). In transformational reasoning, people learn through visualizing how physical events would play out if they were to actually occur. Einstein’s famous thought experiment that led to general relativity theory is considered a famous example of this type of reasoning and, behaviorally, might be captured by some combination of “seeing in the absence of things seen” (Skinner, 1963) and known behavioral processes of generalization
12
Statistics for Applied Behavior Analysis Practitioners and Researchers
(Kirby & Bickel, 1988) and covert problem-solving (Palmer, 2009). As another example, abductive reasoning occurs when the sequence of steps involves probabilities of being likely (as opposed to necessarily following each other) and where data are used to make generalized—but probabilistic—conclusions. Regardless of how one interprets the two examples in this paragraph, the point is that deductive and inductive reasoning are sets of behaviors tacted by humans that, though described in distinct ways, seem unlikely to capture (1) how researchers and practitioners actually go about their work and (2) all possible descriptions of “reasoning” that behavior analysts engage in. To conclude this myth, we want to note that the entire section thus far has been devoid of mentions of statistics. We discussed how one might use past research and knowledge to come up with a question about the universe. And, we discussed how one might “update” or “improve” their understanding of the universe after running some kind of experiment. Thus conflating statistics with only a deductive approach to science is likely missing the mark.
“Us vs. them” is a false dichotomy The best summary of this section is likely to return to the definition of statistics we described previously. Statistics is a branch of mathematics whose topic is the collection, analysis, interpretation, and presentation of aggregate quantitative data (Merriam-Webster, 2021). Myths that conflate statistics with group design research and deductive reasoning are just that—myths. Unfortunately, these myths have led to the pairing of the stimulus “statistics” with two other sets of notorious verbal stimuli in “us” vs. “them.” Through basic behavioral processes of stimulus stimulus pairings, such an “us” vs. “them” misappropriation of the word statistics has likely misled many behavior analysts to behave as though “statistics” are aversive stimuli to be avoided. But, in reality, statistics are not about “us” vs. “them” and how behavior analysts do things differently (often assumed as better) than other scientists. Rather, using statistics is about squeezing the most juice out of your data that you can based on the question you are trying to answer and the audience with whom you are communicating.
Statistics in applied behavior analysis In this final section of this chapter, we aim to make statistics feel more familiar than they likely do. Remember, statistics are simply the
The requisite boring stuff part I: Defining a statistic and the benefit of numbers
13
behaviors and behavioral products involved in the collection, analysis, interpretation, and presentation of aggregate quantitative data. Considering the fundamental role of observation and measurement in ABA (Baer et al., 1968), behavior analysts likely engage in some form of statistical behavior in their daily professional lives, even if they are not fully aware of doing so. Our purpose here is not to provide a treatise on measurement, research tactics, or decision-making models that consider the conditions under which measurement and aggregation practices should be employed—a number of such resources already exist (e.g., Cooper et al., 2020; Johnston et al., 2020; LeBlanc et al., 2016; Sidman, 1960). Rather, we highlight examples of statistical practices commonly employed by behavior analysts to illustrate that yes, as a behavior analyst, you’re already in the game! Behavior analysts are most often interested in capturing responses emitted by an individual across time. At the most basic level, doing so involves no aggregation at all and, thus, won’t meet the definition of statistics. However, we decided to mention it here because of the substantial role response-by-response analysis has played in the history of behavior analysis and to provide a context to compare other practices behavior analysts employ that are statistical in nature. With responseby-response analysis, each response is depicted as it occurs in time. Note, however, as soon as a response-by-response analysis displays the cumulative frequency, it meets the definition of a statistic. For example, Fig. 1.2 depicts a cumulative frequency graph containing hypothetical data from a reinforcer assessment. For an example from the
Figure 1.2 Example of a cumulative frequency graph.
14
Statistics for Applied Behavior Analysis Practitioners and Researchers
published behavior analytic literature, Pisman and Luczynski (2020) evaluated child preference for three play contexts while their caregivers implemented play-based instruction. The authors recorded each child’s initial-link selection during a concurrent-chains arrangement and depicted each selection using a cumulative frequency graph. Behavior analysts commonly, perhaps most commonly, aggregate data at the session level. We’d venture to guess the two most common ways behavior analysts aggregate and summarize data are as response rate and the percentage of opportunities with some response. To illustrate the former, behavior analysts regularly assess problem behavior emitted by their clients through functional analysis procedures. An experimental functional analysis includes collecting the frequency of each behavior as the behavior analyst manipulates environmental variables, aggregating the recorded frequency of behaviors that occurred during each observation as response rate (frequency divided by time), depicting response rate in a line graph, and visually analyzing and interpreting the aggregated data depicted on the graph. Fig. 1.3 provides an example of a graph depicting hypothetical data from an experimental functional analysis (for two examples, among many, of behavior analysts employing response rate in research, see Drifke et al., 2020; Miller et al., 2021). Hmm, wait a minute, collecting, aggregating, analyzing, interpreting aggregate data—that’s statistics! It sure is. And, the specific way that the behavior analysts interpret the data they see is through the lens of the operant model of behavior—the threeterm contingency.
Figure 1.3 Example of a line graph depicting responses per minute.
The requisite boring stuff part I: Defining a statistic and the benefit of numbers
15
Figure 1.4 Example of a line graph depicting the percentage of opportunities with some response.
Behavior analysts are also often interested in aggregating data at the session level for behavior that occurs in response to task presentation. To illustrate, behavior analysts may implement trial-based instruction with their clients to establish language skills. In doing so, the behavior analyst would collect the frequency of independent correct responses across trials, aggregate response-by-response data into a single number of the frequency of independent correct responses divided by the total number of trials, depict the percentage of trials with independent correct responses in a line graph, and visually analyze and interpret the aggregated data depicted on the graph. Fig. 1.4 provides an example of a graph depicting hypothetical client data from trialbased instruction (for two examples, among many, of behavior analysts employing percentage of opportunities with some response in research, see Clements et al., 2021; Halbur et al., 2021). Yep, again, statistics! Beyond evaluating responding across time, behavior analysts sometimes seek to understand consumer responding across environmental variables of interest. To illustrate, Ingvarsson and Le (2011) evaluated the relative influence of different prompt types to establish intraverbals for four participants with autism spectrum disorder. As expected, the authors collected the frequency of correct responses, calculated and depicted the total correct responses per session in a line graph, and visually analyzed and interpreted the aggregated data; their analysis didn’t end there. Interestingly, they also aggregated data in a second
16
Statistics for Applied Behavior Analysis Practitioners and Researchers
way by calculating total training trials to criterion, depicted the total training trials to criterion in bar graphs, and visually analyzed and interpreted these aggregated data. Employing statistics in this way provided immediate benefit for the reader as it facilitated interpretation of outcomes within and across participants. So far, the examples provided involved an interest in the occurrence of some behavior as a function of environment variables. Yet, behavior analysts also routinely leverage statistics as a means to evaluate and communicate the degree to which our measurement of a behavior of interest is reliable (i.e., interobserver agreement [IOA]); the degree to which our behavior as interventionists occurred as intended (e.g., procedural integrity, treatment integrity); and the degree to which the behaviors of interest, the behavior change procedures employed, and the outcomes of behavior change efforts are acceptable (i.e., social validity). In each of these cases, it is common for behavior analysts to collect-aggregate-analyze-interpret data—again, statistics! Using IOA as a specific example, there are many threats to the accuracy and reliability of data obtained through nonautomated data recording (ubiquitous in ABA). Behavior analysts commonly calculate IOA to serve as a proxy for measurement quality. Although a number of methods exist, IOA generally involves data collection by at least two independent observers, aggregating the data collected by each observer, analyzing by comparing the data obtained from the independent observations, and interpreting the degree to which the independent observers obtained similar data. The reporting of this process and outcomes (typically expressed by arithmetic mean agreement) is a gold standard in behavior analytic research (Cooper et al., 2020; p. 117). By golly, statistics again! By this point, we hope the provided examples carried a wave of familiar serenity into your life about “statistics.” We also hope it helped to fine-tune your perspective on the relevance of statistics for behavior analysts. Yet, the examples provided only scratch the proverbial surface of what statistics has to offer us as behavior analysts. As our science evolves, so too does our need to understand and apply statistics. Historically, behavior analysts have resisted research methodologies that go hand in hand with statistical practices that involve aggregating behavioral data across participants (see Branch & Pennypacker, 2013). Yet, there appears to be a paradigm shift
The requisite boring stuff part I: Defining a statistic and the benefit of numbers
17
occurring within ABA3 that has included calls for behavior analysts to employ between-subject experimental designs (e.g., Hanley, 2017), the emergence of a research area referred to as the applied quantitative analysis of behavior (Jarmolowicz et al., 2021), and the publication of studies employing such designs and analyses in behavior analytic outlets (e.g., Journal of Applied Behavior Analysis; e.g., Fisher et al., 2020; Greer & Shahan, 2019; see also the Perspectives on Behavior Science special section [Volume 44, Issue 4] on applications of quantitative methods). A new era that extends our current statistical practices is upon us.
Chapter summary Statistics has had a notorious reputation in behavior analysis. Part of the reason this may have carried to the present day is via inaccurate conflation of “statistics” with group design research and deductive reasoning which also has (we believe unfounded) notorious reputations in behavior analysis. Whether the original aversion to all statistics was warranted, historically, is unknown to us as neither of us were around. But, modern statistics involves the collection, analysis, interpretation, and presentation of aggregate quantitative data. Behavior analysts collect, analyze, interpret, and present aggregate quantitative data already in their day-to-day lives. That is, behavior analysts already use statistics even if they do not call it that. However, just as there are “appropriate” ways to present and analyze data visually as to avoid misinterpretation and misleading your audience, there are “appropriate” ways to do statistics. Throughout the remainder of the book, we hope to show you at least three things. First, statistics are not scary and you likely already think about your data in similar ways. Second, we hope to show the assumptions and “appropriate” ways of approaching your data when you use statistics to analyze, interpret, and present your data. Finally, we hope to show that behavior analysts who get comfortable with using statistics can do some neat things with their data that they would be unable to do without the use of statistics. We are excited you have joined us on this journey. And, we are even more excited to see behavior analysts use and talk about statistics in exciting ways. 3
The reasons for this shift seem a bit outside the scope of this chapter. But, as a timeline of utility, consider the ongoing evolution of the quantitative practices employed by our colleagues in the experimental analysis of behavior.
18
Statistics for Applied Behavior Analysis Practitioners and Researchers
References Baer, D. M., Wolf, M. M., & Risley, T. (1968). Current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 1(1), 91 97. Available from https://doi.org/10.1901/jaba.1968.1-91. Baum, W. M. (1974). On two types of deviation from the matching law: Bias and undermatching. Journal of the Experimental Analysis of Behavior, 22(1), 231 242. Available from https://doi.org/ 10.1901/jeab.1974, 22-231. Baum, W. M., & Rachlin, H. C. (1969). Choice as time allocation. Journal of the Experimental Analysis of Behavior, 12(6), 861 874. Available from https://doi.org/10.1901/jeab.1969.12-861. Branch, M. (2014). Malignant side effects of null-hypothesis significance testing. Theory & Psychology, 24(2), 256 277. Available from https://doi.org/10.1177/0959354314525282. Branch, M. N., & Pennypacker, H. S. (2013). Generality and generalization of research findings. In G. J. Madden, W. V. Dube, T. D. Hackenberg, G. P. Hanley, & K. A. Lattal (Eds.), APA handbook of behavior analysis, Vol. 1: Methods and principles (pp. 151 175). American Psychological Association. Catania, M. (1960). Tactics of scientific research: evaluating experimental data in psychology. Authors Cooperative, Inc. Clements, A., Fisher, W. W., & Keevy, M. (2021). Promoting the emergence of tacting three-digit numerals through a chain prompt combined with matrix training. Journal of Applied Behavior Analysis, 54(4), 1405 1419. Available from https://doi.org/10.1002/jaba.861. Cooper, J. O., Heron, T. E., & Heward, W. L. (2020). Applied behavior analysis (3rd ed.). Merrill-Prentice Hall. Cox, D. J. (2019). The many functions of quantitative modeling. Computational Brain & Behavior, 2(3 4), 166 169. Available from https://doi.org/10.1007/s42113-019-00048-9. Dallery, J., & Soto, P. L. (2013). Quantitative description of environment-behavior relations. In G. J. Madden (Ed.), APA handbook of behavior analysis: Vol. 1 (pp. 219 249). American Psychological Association. Drifke, M., Tiger, J. H., & Lillie, M. A. (2020). DRA contingencies promote improved tolerance to delayed reinforcement during FCT compared to DRO and fixed-time schedules. Journal of Applied Behavior Analysis, 53(3), 1579 1592. Available from https://doi.org/10.1002/jaba.704. Ennis, R. (1969). Logic in teaching. Prentice Hall. Feynman, R. P. (1965). The Feynman lectures on physics: Volume 1. Chapter 3: The relation of physics to other sciences. Basic Books. Available from https://www.feynmanlectures.caltech.edu/I_03.html. Fisher, W. W., Luczynski, K. C., Blowers, A. P., Vosters, M. E., Pisman, M. D., Craig, A. R., Hood, S. A., Machado, M. A., Lesser, A. D., & Piazza, C. C. (2020). A randomized clinical trial of a virtual-training program for teaching applied-behavior-analysis skills to parents of children with autism spectrum disorder. Journal of Applied Behavior Analysis, 53(4), 1856 1875. Available from https://doi.org/10.1002/jaba.778. Greer, B. D., & Shahan, T. A. (2019). Resurgence as choice: Implications for promoting durable behavior change. Journal of Applied Behavior Analysis, 52(3), 816 846. Available from https:// doi.org/10.1002/jaba.573. Halbur, M., Kodak, T., Williams, X., Reidy, J., & Halbur, C. (2021). Comparison of sounds and words as sample stimuli for discrimination training. Journal of Applied Behavior Analysis, 54(3), 1126 1138. Available from https://doi.org/10.1002/jaba.830. Hanley, G. P. (2017). Editor’s note. Journal of Applied Behavior Analysis, 50(1), 3 7. Available from https://doi.org/10.1002/jaba.366. Ingvarsson, E. T., & Le, D. D. (2011). Further evaluation of prompting tactics for establishing intraverbal responding in children with autism. The Analysis of Verbal Behavior, 27(1), 75 93. Available from https://doi.org/10.1007/BF03393093.
The requisite boring stuff part I: Defining a statistic and the benefit of numbers
19
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. Available from https://doi.org/10.1371/journal.pmed.0020124. Jarmolowicz, D. P., Greer, B. D., Killeen, P. R., & Huskinson, S. L. (2021). Applied quantitative analysis of behavior: What it is, and why we care—Introduction to the special section. Perspectives on Behavior Science, 44(4), 503 516. Available from https://doi.org/10.1007/s40614021-00323-w. Johnston, J. M., Pennypacker, H. S., & Green, G. (2020). Strategies and tactics of behavioral research and practice (4th ed.). Routledge. Kirby, K. C., & Bickel, W. K. (1988). Toward an explicit analysis of generalization: A stimulus control interpretation. The Behavior Analyst, 11, 115 129. Available from https://doi.org/ 10.1007/BF03392465. LeBlanc, L. A., Raetz, P. B., Sellers, T. P., & Carr, J. E. (2016). A proposed model for selecting measurement procedures for the assessment and treatment of problem behavior. Behavior Analysis in Practice, 9(1), 77 93. Available from https://doi.org/10.1007/s40617-015-0063-2. Marr, M. J. (2015). Reprint of “Mathematics as verbal behavior.”. Behavioural Processes, 114, 34 40. Available from https://doi.org/10.1016/j.beproc.2015.03.008. Mazur, J. E. (1986). Choice between single and multiple delayed reinforcers. Journal of the Experimental Analysis of Behavior, 46(1), 67 77. Available from https://doi.org/10.1901/ jeab.1986.46-67. McDowell, J. J. (2005). On the classic and modern theories of matching. Journal of the Experimental Analysis of Behavior, 84(1), 111 127. Available from https://doi.org/10.1901/ jeab.2005.59-04. Merriam-Webster. Statistics. (2021) Retrieved from the website: https://www.merriam-webster. com/dictionary/statistics Merriam-Webster. Models. (2022). Retrieved from the website: https://www.merriam-webster. com/dictionary/model Michael, J. (1982). Distinguishing between discriminative and motivational functions of stimuli. Journal of the Experimental Analysis of Behavior, 37(1), 149 155. Available from https://doi.org/ 10.1901/jeab.1982.37-149. Miller, S. A., Fisher, W. W., Greer, B. D., Saini, V., & Keevy, M. D. (2021). Procedures for determining and then modifying the extinction component of multiple schedules for destructive behavior. Journal of Applied Behavior Analysis. Available from https://doi.org/10.1002/jaba.896, Advance online publication. Mish, F. (1991). Webster's ninth new collegiate dictionary, Merriam-Webster Inc. Moxley, R. A. (1996). The import of Skinner’s three-term contingency. Behavior and Philosophy, 24(2), 145 167. Available from https://www.jstor.org/stable/27759351. Palmer, D. C. (2009). Response strength and the concept of the repertoire. European Journal of Behavior Analysis, 10(1), 49 60. Pisman, M. D., & Luczynski, K. C. (2020). Caregivers can implement play-based instruction without disrupting child preference. Journal of Applied Behavior Analysis, 53(3), 1702 1725. Available from https://doi.org/10.1002/jaba.705. Rachlin, H. (2006). Notes on discounting. Journal of the Experimental Analysis of Behavior, 85 (3), 425 435. Available from https://doi.org/10.1901/jeab.2006.85-05. Sidman, M. (1960). Tactics of scientific research: Evaluating experimental data in psychology. Basic Books. Simon, M. A. (1996). Beyond inductive and deductive reasoning: The search for a sense of knowing. Educational Studies in Mathematics, 30, 197 210. Available from https://doi.org/10.1007/ BF00302630.
20
Statistics for Applied Behavior Analysis Practitioners and Researchers
Simpao, A. F., Ahumada, L. M., & Rehman, M. A. (2015). Big data and visual analytics in anaesthesia and health care. British Journal of Anaesthesia, 115(3), 350 356. Available from https://doi.org/10.1093/bja/aeu552. Skinner, B. F. (1938). The behavior of organisms: An experimental analysis. Appleton-CenturyCrofts. Skinner, B. F. (1953). Science and human behavior. Macmillan. Skinner, B. F. (1956). A case history in scientific method. American Psychologist, 11(5), 221 233. Available from https://doi.org/10.1037/h0047662. Skinner, B. F. (1957). Verbal behavior. Appleton-Century-Crofts. Skinner, B. F. (1963). Behaviorism at fifty. Science (New York, N.Y.), 140(3570), 951 958. Available from http://www.jstor.org/stable/1711326?origin 5 JSTOR-pdf. Skinner, B. F. (1974). About behaviorism. Alfred A. Knopf. Steed, C. A., Ricciuto, D. M., Shipman, G., Smith, B., Thornton, P. E., Wang, D., Shi, X., & Williams, D. N. (2013). Big data visual analytics for exploratory earth system simulation analysis. Computers & Geosciences, 61, 71 82. Available from https://doi.org/10.1016/j.cageo.2013.07.025. Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129 133. Available from https://doi.org/10.1080/ 00031305.2016.1154108. Young, M. E. (2018). A place for statistics in behavior analysis. Behavior Analysis: Research and Practice, 18(2), 193 202. Available from https://doi.org/10.1037/bar0000099.
CHAPTER
2
The requisite boring stuff part II: Data types and data distributions If you live to be one hundred, you've got it made. Very few people die past that age. George Burns
Introduction If you are reading this sentence, then a big round of applause to you! You made it through the first chapter without being sufficiently scared off by all the talk of statistics, numbers, and quantitative theory. We can promise you that it only gets more fun from here. If you are coming straight from the end of Chapter 1, you can likely jump ahead to the section titled, “What we haven’t considered so far.” However, in case you are one of those readers who like to set down a book for a little bit after finishing each chapter, and some time has elapsed since you finished Chapter 1, here is a quick recap of what we discussed and how it fits with where we are headed. To successfully navigate any landscape often requires that you first identify where you are at. This is a book about statistics. So, in Chapter 1, we provided an overview of the landscape this book sits within and the tools at our disposal. To begin, we defined statistics as a branch of mathematics whose topic is the collection, analysis, interpretation, and presentation of aggregate quantitative data (MerriamWebster, 2021). The “branch of mathematics” bit just means we are using numbers to describe our data, and we analyze the data we collect based on the logic and rules of mathematics. In many instances, this just involves addition (e.g., adding up the number of times a behavior occurred) and division (e.g., dividing the number of times a behavior occurred by the length of a session to get rate of responding). Statistics for Applied Behavior Analysis Practitioners and Researchers. DOI: https://doi.org/10.1016/B978-0-323-99885-7.00010-6 © 2023 Elsevier Inc. All rights reserved.
22
Statistics for Applied Behavior Analysis Practitioners and Researchers
Outside that “branch of mathematics” phrase, the rest is pretty innocuous. Behavior analysts allocate a good portion of their professional time to collecting, analyzing, interpreting, and presenting aggregated quantitative data. Common statistics in this realm include rate of responding, percentage of total opportunities correct, calculations of interobserver agreement (IOA), and average latency to respond. Even the simple data display in cumulative records technically aggregate (sum up) the total number of responses to the point in time represented on the x-axis to allow for response-by-response analysis. In short, statistics are everywhere and you are likely already using them. So what’s the historical rub about statistics in behavior analysis? Well, neither Jason or David know exactly what happened as we both were born just a smidgeon after the onset of the field and once much of the generalized aversion seemed to be rooted. However, we suspect that overgeneralization of a valid and true concern regarding one branch of statistics was the culprit (Sidman, 1960). Combine this valid concern with the unique visual analysis approach used by early behavior analysts and the likely criticism they received for using visual analysis instead of quantitative analyses and—voila—an “us” vs. “them” false dichotomy was also likely born. And, we all know what happens when Yankee and Red Sox fans get together and what occurred with the Bitter Butter Battle (Seuss, 1984). The rest of us know it’s still the same great game of baseball and buttered toast is delicious regardless of whether you butter the top or bottom. But, social contingencies are what they are and here we sit today. Briefly, Skinner (1938) argued that scientists should experimentally isolate and study the variables that control differences in responding between individuals. This contrasted with the hot and burgeoning field of null hypothesis significance testing (NHST) that was making waves in psychological fields of the early 1900s. With NHST, the idea is to use a specific set of mathematical techniques (we’ll get to these later in the book) to quantify the probability that you got the results you did assuming that the independent variable (IV) actually has no influence on the dependent variable (DV). If you read that last sentence closely, you’ll realize NHST doesn’t tell you anything about the likelihood that your IV does influence your DV which is really what we want to know (for a full discussion of the associated faulty logic, see Branch, 2014). Ironically, statisticians have been saying the same thing for years (Wasserstein & Lazar, 2016). But, nevertheless, NHST grew in
The requisite boring stuff part II: Data types and data distributions
23
popularity, was often used in group design research, and increasingly was associated with what the “other” scientists used who did not use visual analysis. But, as noted earlier and in Chapter 1, statistics and NHST are not synonymous. Rather, NHST represents only one of many methods available for those pursuing statistical inferences. And conflating statistics with group design research is just that: a conflation. In defining statistics as the branch of mathematics whose topic is the collection, analysis, interpretation, and presentation of aggregate quantitative data, we can see that behavior analysts use statistics regularly. To close out the last chapter, we provided some examples of everyday practices for many behavior analysts wherein they use statistics on the regular. With that, you’re fairly caught up on what we have covered so far.
What haven’t we considered so far? To do statistics brings in all those fun verbs from our definition. We collect data, we aggregate it in some way, we analyze that data, we interpret what all that verbal behavior means, and we present what we have found out to the world. This chapter is about the first two of those verbs: collecting and aggregating data. In short, how we collect data depends on the type of data that we are collecting. And, how we aggregate the data for eventual analysis depends on the type of distribution the data likely follows if we had collected a bunch of it. Trying to use the wrong data collection method for the type of data you are collecting, or wrongly aggregating your data based on the distribution it follows, is like trying to sell a hot dog at a baseball stadium that is made of tofu, is shaped in square blocks, and sits atop a bowl of mixed greens and quinoa. As delicious as this dish is likely to be, it’s not a hot dog. So, claiming and selling it as a hot dog is. . .not quite right. To show why the topics of this chapter are important, it may help to draw a parallel with some of the decisions a behavior analyst has to make when they want to conduct a functional analysis. Consider a situation where a client emits aggressive behavior and a behavior analyst has been asked to accurately identify the function of the behavior so that an appropriately matched and effective intervention can be designed. What if the measure of aggression used by the behavior analyst was the number of times the client wore white vs. black socks, and they reported that the function of the client’s aggressive behavior
24
Statistics for Applied Behavior Analysis Practitioners and Researchers
occurs at 1.5 white socks per day on average. We suspect many of you quickly realized how absurd it is to measure what color socks someone is wearing as a measure of aggression. This is akin to choosing the wrong data type. Claiming someone wears 1.5 white socks per day is also absurd as socks only come in whole numbers. This is akin to using the wrong data distribution for your data. Analyzing your data without knowing the data type and distribution of your data is kind of like not knowing whether you’re measuring sock color or face-directed aggressions during your functional analysis. When we use the wrong data type or the wrong distribution, our analyses can quickly devolve into being illogical. Fortunately, few behavior analysts make errors this egregious. This is likely because enough people have published examples of the “correct” way to handle data of different types and from different distributions such that imitating others often serves us well enough. But, if all of this talk about data types and distributions is a bit new to you, then this chapter has the stuff you need. After reading it, we hope you won’t have to worry about data on different colored halves of socks working their way into your functional analysis.
Data types A quick caveat is in order before we dive into the four data types commonly discussed in statistics books the world over (Table 2.1). As with many things in life, the boundaries created around things are often human made and “nature does not know it!” (Feynman, 1965). Thus, though we talk about each of these data types with hard and fast boundaries, it is not always the case that data collected in one way cannot be transformed or modified into another data type. As we discuss in more detail later, each data type has properties beneficial for some kinds of analytics and that pose limitations for other kinds of analytics. The type we end up going with often depends on the question we have or the data type in which our data to have been collected.
Discrete data The first distinction that we can make is whether our data are discrete or continuous. Discrete data are data with a finite or countably infinite number of possible outcomes. For example, we can count the number of dogs that someone has, the number of championships the
The requisite boring stuff part II: Data types and data distributions
25
Table 2.1 Common data types behavior analysts encounter in practice. Category
Data type
Description
General, examples
specific Nominal
Used to label variables without providing any quantitative value
Culture/Ethnicity; Nationality; Gender; State of Residence; Diagnoses; Favorite Pizza Toppings
Condition (e.g., Baseline vs. Treatment); Setting (e.g., Home vs. School); Therapist; Skill Domain (e.g., verbal behavior, activities of daily living)
Ordinal
Used for data that can be ranked on a scale, but without a degree or size difference between values
Education level; Income (e.g., Low, Middle, Upper); Job Titles (e.g., Junior BCBA, Senior BCBA); Some Likert scales
Preference assessment rankings; Focused vs. Comprehensive Treatment; Acceptable rating scales of goals, interventions, and outcomes
Quantitative
Used for data that can be ranked on a scale and with a degree or size difference between values
Students in Class; Patients Admitted to Hospital; Deaths from Gun Violence; Graduates with Job within Six Months of Graduating; Steps Walked; Some Likert Scales
Session Number; Trials Correct Responding; One-hot Encoded Setting; Responses Emitted in Session; Clients on Caseload; Graduates who pass BACB Exam; Units Billed to Insurance
Quantitative
Used for data that can take any real number value between negative and positive infinity
Temperature; Height; Weight; Time of Day; Amount of Rainfall; Calories Consumed; Speed of Car
Average Responses per Min; Latency to Respond; Duration of Responding
Discrete
Continuous
Examples in ABA practice
ABA, Applied behavior analysis; BACB, Behavior Analyst Certification Board; BCBA, Board Certified Behavior Analyst.
Pittsburgh Pirates have won, the number of mands Jamal emits during a session, the number of trials presented, or the number of times a rat presses a lever in an operant chamber. These are all discrete because the count of each of these examples can only take on a whole number value—a discrete value—often referred to as an integer. Three types of discrete data are commonly found in the behavior analytic literature.
Nominal data The first type of discrete data behavior analysts commonly collect are nominal data. Sometimes referred to as categorical data, nominal data are collected when behavior analysts assign a label to data where the label does not provide any quantitative information. Though analyses with nominal data are often qualitative or require combining nominal data with a numerical data type, we will see later that this isn’t always
26
Statistics for Applied Behavior Analysis Practitioners and Researchers
the case. It just sometimes takes a little bit of work to get the data to the point where we can use them numerically. Nominal data are commonly used by social and behavioral science researchers. One of the most common areas where nominal data are used is when researchers describe the demographics of research participants. Here, researchers will often list the number of participants who self-identify with labels spanning: culture or ethnicity; where in the world they reside (nation; state within a nation); gender; diagnoses; and so on. The goal here is typically twofold: first, to help the reader understand the degree of similarity between the research participants and whomever the reader may hope to generalize the research findings; and, second, for randomized controlled trials, to demonstrate the (hopefully lack of) differences between participants in the treatment and control groups1. In each of these categories, the researcher is unable to rank, order, or otherwise transform raw labels into numbers. Behavior analysts also publish and talk about a lot of interventionrelated nominal data. Perhaps the most common nominal data collected are whether the behavioral data collected came from a “Baseline” or an “Intervention” condition. We might collect data on the location where an intervention session took place such as “Clinic,” “Home,” or “School.” We might label the specific skill we’re working on as falling under the category of “Verbal Behavior,” “Activities of Daily Living,” or “Leisure Activities and Play.” Or, maybe we record the name of the specific therapist who conducted an intervention session. Similar to demographic data, these categorical data—by themselves—are often most useful when we combine them with other data such as how rates of responding (an additional data type) differ across condition, setting, therapist, or skill domain. Sometimes we can turn nominal data into numbers. Perhaps the most common way to do this is sometimes referred to as creating “dummy variables” or “one-hot encoding.” Here, the idea is to turn our categorical labels into a 1 or a 0 denoting the presence (1) or absence (0) of a specific variable for a given observation. For example, rather than having a column in our dataset with the words “Baseline” 1 Of note, authors publishing in the applied behavior analysis scientific literature increasingly call for researchers to publish participant demographics so that we can get a better understanding of for whom, and under what conditions, the effectiveness of an intervention might be impacted (e.g., Brodhead et al., 2014; Jones et al., 2020; Li et al., 2017).
The requisite boring stuff part II: Data types and data distributions
27
Table 2.2 Example of how behavior analysts might convert nominal data to numerical. Condition
Encoded condition
Setting
Encoded clinic
Encoded home
Encoded school
Baseline
0
Clinic
1
0
0
Baseline
0
Clinic
1
0
0
Baseline
0
Home
0
1
0
Intervention
1
School
0
0
1
Intervention
1
Home
0
1
0
Intervention
1
Clinic
1
0
0
or “Intervention” to describe the conditions under which that observation was made, we can talk about the presence or absence of an intervention in effect and convert “Baseline” to a 0 and “Intervention” to a 1 (left side of Table 2.2). We can also turn nominal data into numbers when there are more than two labels present for a specific variable in our dataset (right side of Table 2.2). As noted above, nominal data—by definition—means there is no inherent way to rank or order the data. So we can’t convert nominal data with more than two labels into a single column with three or more numerical values because the result would be nonsensical. For example, consider what happens if we convert “Clinic,” “Home,” and “School” to 0, 1, and 2, respectively. It would be nonsensical to talk about “School” as being twice as much as “Home.” But, we still can talk about each session as being in the clinic (“yes” or “no”), at home (“yes” or “no”), or at school (“yes” or “no”). Thus we can expand our single column with the nominal labels for setting into three columns where each column denotes “yes” (1) the intervention was in that setting or “no” (0) it was not. We’ll talk later in this chapter and in the book about why turning nominal data into numbers is useful when conducting different types of analyses. For now, however, the main point is that nominal data are everywhere in behavior analysis and, if need be, we can convert nominal data to numbers to take advantage of the wonderful properties that come with our data being numbers instead of text (Chapter 1).
Ordinal data A second discrete data type is ordinal data. Ordinal data refers to data that has two properties (Table 2.1). First, the data can easily be ranked
28
Statistics for Applied Behavior Analysis Practitioners and Researchers
on a scale, meaning that one of the labels or values that data measurement could take is naturally greater than (or lesser than) another label or value that the data measurement could take. Second, there is no known degree or size difference between the different labels or values that a particular measurement could take. Stated differently, the data have a natural order (which nominal data do not) but you can’t really say ordinal data with value “2” is twice as much as ordinal data with value “1.” Some examples will likely help. Returning to our demographic examples, many demographic data are of the ordinal data type. For example, education level is a great example of ordinal data. When reporting on education level, people typically will select one out of many options to indicate the highest level of education they have received such as High School, Bachelor’s, Master’s, or Doctorate. Those familiar with the United States educational system know that someone with a Bachelor’s degree has more education than someone with a High School degree. And, someone with a Doctorate degree has more education than someone with a Master’s degree. But, because some people may finish any of these degrees in 2 years or 10 years, the label—by itself—does not tell you anything about the degree or size of difference in years of education. We only know that one is more or less than the others because the degrees have to be completed in a specific order: ordinal data. Many other types of commonly collected data in the social and behavioral sciences are ordinal. For example, many business settings provide titles to the different roles to indicate years of experience in that area (and that often coincide with pay scales) such as “Junior,” “Senior,” or “Principle” roles. Another example might be years of experience in some specific area as asked on a survey where the responses are the boxes: 05 years, 510 years, and 10 1 years. Or, as a final example, anyone who has completed survey work has likely responded to questions with Likert scale responses such as “Strongly Disagree—Disagree—Agree—Strongly Agree”; “Strongly Dislike—Dislike—Like—Strongly Like”; or “Would Never Buy—Probably Wouldn’t Buy—Probably Would Buy—Would Definitely Buy.” In each instance, the values can easily be ordered along some kind of scale but the degree or size difference between the levels might not be known. Behavior analysts also collect or use a lot of ordinal data in their daily research and practice. For example, the results of a Multiple
The requisite boring stuff part II: Data types and data distributions
29
Stimulus without Replacement (MSWO) preference assessment provide a ranked order of which items are more or less preferred than others. When we submit funding requests to insurance companies, we often specify whether the individual needs “comprehensive” or “focused” treatment. And, behavior analysts often evaluate the degree to which consumers view goals, interventions, and outcomes as acceptable using scales that range from “Strongly Disagree” to “Strongly Agree.” As above, each of these examples include data collected or being used wherein the order of the data denotes whether one value is more or less than the others, but the size of the difference is unknown with the labels alone. As with nominal data, ordinal data start to become analytically useful to the extent that we can turn the labels into numbers. Sometimes, this comes with the collected data itself. For example, preference assessment rankings are typically ordered with “1” being the most preferred item, “2” being the second most preferred item, and so on. In other situations, similar to nominal data, we may need to turn the textual labels into numbers (i.e., encode our data)2. For example, we might assign “1” to “High School Diploma,” “2” to “Bachelor’s Degree,” and so on. And, many Likert scales can be handled in a similar format such as assigning “1” to “Strongly Disagree,” “2” to “Disagree,” and so on. The examples in the last paragraph start to highlight a few points of caution that behavior analysts should be aware of when transforming ordinal data into numbers for analyses. First, the preference assessment rankings were encoded such that “lower numbers mean more” as items ranked with lower numbers are “more preferred” by the consumer. In contrast, the education and agreement scales were encoded such that “higher numbers mean more” as higher numbers mean “more education” or “greater agreement” with the statement. When conducting analytics using multiple ordinal data types, you’ll want to keep track of the directions of “better” or “more” as it’ll impact how you interpret your results. Second, though we can technically keep ordinal data as a single column in our spreadsheet for analysis, it may sometimes make sense to turn the ordinal data into categorical. 2
Of note, many survey platforms such as Qualtrics and SurveyMonkey allow the researcher or practitioner to embed these encodings into the platform so the data can be downloaded in a numerical format to facilitate analysis.
30
Statistics for Applied Behavior Analysis Practitioners and Researchers
For example, we want to compare rates of responding when the most preferred item is used as a putative reinforcer compared to when the third most preferred item is used as a putative reinforcer. When we want to treat ordinal data as categorical, we can always one-hot encode the data, if needed. Later in this book, we’ll talk more about how this requires good bookkeeping on part of the behavior analyst and different strategies to help reduce the likelihood we misinterpret the results of our analyses. For now, however, the main point is that ordinal data are everywhere in behavior analysis; they often come in numerical format or can easily be converted to a numerical format; and, just because they are numerical does not mean we want to analyze them while retaining their quantitative properties.
Quantitative discrete data The last discrete data type is quantitative. Quantitative discrete data are data where different measurements can also be ranked on a scale, just like ordinal data. The difference between quantitative discrete data and ordinal data is that we also have information about the degree or size of the difference between the measurements collected. Data where we know both the rank and the size of difference between different measurements are the most flexible kind of discrete data we can have when it comes to conducting analytics. Social and behavioral researchers and practitioners the world over make heavy use of quantitative discrete data. Examples here are legion and include the number of students in a classroom; the number of patients admitted to the emergency room (ER) each week; the number of deaths from gun violence in the United States; the number of ice cream cones sold during the summer in Chicago; the number of university graduates who successfully get a job within six months after they graduate; or the number of black or white socks Halima wears. In each of the abovementioned instances, we can only talk about discrete, whole number measurements. It would be nonsensical to say 22.5 students are in a classroom, 12.84 patients were admitted to the ER, 3045.23 people died from gun violence, 786.21 ice cream cones were sold, 954.37 people got jobs, or that someone wears 1.56 white socks. This would be illogical because people, ice cream cones, and socks don’t come in partial amounts.
The requisite boring stuff part II: Data types and data distributions
31
Behavior analysts also make heavy use of quantitative discrete data. For example, the x-axis of many time series graphs lists the session number (or date) during which data were collected. For our DVs, we often count the number of responses a client emitted during a session, the total number of discrete trials during which a student emitted the correct response, or the number of trials/intervals where two independent observers recorded the same data. The percentage of trials where the client emits the correct response is another common discrete quantitative data used by behavior analysts. Though the percentage of correct trials is often presented with decimals, the values it can take fall into discrete buckets. For example, if there are 10 total trials, the percentage of correct responses can only take the values of 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, or 100%. Similarly, if there are seven trials, the percentage of correct responses can only take the values of 0%, 14.29%, 28.57%, 42.86%, 57.14%, 71.43%, 85.71%, or 100%. The same holds true for IOA calculations. The important point here is that the type of data we have depends on how it was collected and the possible values it can take, not necessarily the verbal stimuli we use to describe it. Behavior analysts also regularly use discrete quantitative data to analyze important characteristics at the organizational level. Here, they might collect and analyze data on the number of clients on each Board Certified Behavior Analyst's (BCBA’s) caseload, the number of Registered Behavior Technicians (RBTs) a BCBA supervises, the number of BCBAs in a region, or the number of staff who turned over in the last quarter. Lastly, paralleling the percentage of trials with correct responses from above, a popularly discussed discrete quantitative data at the field level in applied behavior analysis (ABA) is the number of students who pass the BACB exam upon graduation from different programs. Again, in each of these instances, the things being counted are sessions, dates, responses, clients, trials, supervisees, BCBAs, or prospective BCBAs. None of these can come in partial amounts, only discrete units. Quantitative discrete data are very useful analytically because of the information they contain. Often behavior analysts will combine quantitative discrete data with nominal or ordinal data when they conduct analyses because it allows us to start looking at the relative degree or size of difference between different nominal or ordinal conditions (Fig. 2.1). For example, we might plot the number of times a client
32
Statistics for Applied Behavior Analysis Practitioners and Researchers
Figure 2.1 Classic time series plot demonstrating an intervention effect of functional communication training (FCT) on the number of correct mands emitted during baseline and treatment sessions. Note that each major type of discrete data is included in the plot.
mands during each session to analyze whether the number of mands is increasing with time. Such a plot likely involves combining the discrete quantitative data of “number of responses” on the y-axis with the session number during which the data were collected on the x-axis; and, we might even add in the nominal discrete data of whether the data were collected during a “Baseline” or “Intervention” period.
Continuous data Though contrary examples may exist somewhere in the universe, continuous data are typically quantitative. Continuous data are data that can take any real number value between negative infinity (or sometimes zero) and positive infinity. Similar to discrete ordinal data and discrete quantitative data, continuous data provides the user with both a ranking (discrete ordinal and discrete quantitative) and information about the degree or size difference between the measurements collected (discrete quantitative). However, because continuous data can take any real number value between negative and positive infinity, it provides the most flexibility and utility for analytics. Because of this analytic flexibility, behavior analysts often try to collect data on a continuous scale as it’s easy to transform continuous data to discrete data; but it’s often very difficult or impossible to transform discrete data to continuous.
The requisite boring stuff part II: Data types and data distributions
33
Humans make heavy use of continuous data in our daily lives. For example, many of us may check the temperature outside before we get dressed; snowboarders check the amount of fresh pow that fell on the slopes before they choose whether to catch the lift; each of us has a specific height and weight; we structure our day around time; people on a diet may track the calories they consume each day; and we attend to the speed at which we drive our car (especially when police cars are around us). In each of these examples, the value the measure could take can go out to an infinite number of decimal places if we use a sensitive enough measurement tool. Behavior analysts also try to make use of continuous data as often as their tools allow. Common examples are the latency to respond or the duration of which a set of operationally defined behaviors continues (e.g., tantrum). Each of these data are collected in units of time and, thus, can take any real value greater than zero which makes them continuous. Perhaps the most common example of continuous data collected by behavior analysts is the responses emitted per minute during an observation period. In this situation, even though responses are a discrete data type, the amount of time we can observe the individual for is continuous. Thus it is possible to obtain a derived “responses per minute” value that takes any real number value greater than or equal to zero making it continuous (see Chapter 3 for additional details about derived point estimates common in behavior analysis). Continuous data are the most useful analytically. This is because when we analyze data, we are often most interested in differences or variations in our measurement as a function of something else such as intervention condition, setting, the therapist working with our client, or behavior analysts working in different clinics or different regions. As might be intuitive, the more values our data can take, the more opportunities there are for differences—even if subtle—to be captured in our collected data. And, it is those differences that tell us something we can act upon. Whether a measure is truly continuous or is actually discrete but presented as continuous depends on the values the measure can take. And, as we’ll see next, sometimes the values of our discrete quantitative data are great enough that our data begins to look and feel as though it were continuous data such that we can take advantage of the wonderful properties that come with continuous data.
34
Statistics for Applied Behavior Analysis Practitioners and Researchers
Data type summary It’s no secret that behavior analysts love data. We use data for just about everything we do and we continuously create systems to collect data, systems to visually display data, and systems to turn those visual displays into action through intervention design and modification. The data we collect can come in many different types, each with benefits and drawbacks. The most common two broad categories of data types are discrete and continuous. Discrete data are data that can only be collected in whole units. Those whole units might be category labels where there is no inherent ranking or order (nominal data); we might be able to rank or order the data though the degree of difference between values is unknown (ordinal data); or we might be able to rank or order the data, the degree of difference between values is meaningful, but our data still can only be collected in whole units (quantitative discrete data). Continuous data are the second broad category and are data where the values the data can take are any real number between negative infinity (or zero) and positive infinity. Learning to identify the data types you work with is useful for several reasons. First, the analyses that behavior analysts can conduct are constrained by the type of data we have collected. This is primarily the result of the distribution the data come from (we cover this throughout the rest of the book). The distribution from which our data comes constrains how we can describe and talk about the data we collect (Chapters 3 and 4); how and what we can say when we relate or compare two different data sources (Chapters 57); and, how we can model the relationships among many data types to describe, predict, and control behavior (Chapters 58).
Data distributions Probability distributions3 Before we jump headlong into different data distributions and the questions they answer, it’ll likely help to talk about what a probability distribution is. This involves two main concepts: probability and distributions. Probability is likely to be intuitive to most behavior analysts. With a few exceptions, operant and respondent behavior are rarely 3 To keep the scary bits out of the way for a gentler reading, all the equations for the different probability distributions are located at the end of the chapter. No worries if you’re not interested in the equations. Excel does the heavy lifting for you with each of these and the Excel doc available here: https://github.com/david-j-cox/SupplMat-Stats-ABA/ shows you what that looks like. We strongly encourage you to check it out as becoming familiar with the equations and how to use probability distributions will likely be a behavioral cusp for many readers.
The requisite boring stuff part II: Data types and data distributions
35
elicited or emitted every time the eliciting or evoking stimulus is presented. The reasons for this are outside the scope of this book; however, a familiar example might help to highlight what we are talking about. Consider a situation where we present a client with 10 learning trials. On each trial, they can either emit the correct response or not. Unless you have a client who is a superhuman, it is unlikely the client is going to emit the correct response on all 10 trials, everyday, in all contexts, for the rest of their life, even if they have already “mastered” the relevant target behaviors. If they are just learning the skill then the number of correct trials might be zero or one. If they are somewhere in the acquisition phase, the number of correct trials might be something like three, five, or eight. And, if they have “mastered” the skill, the number of trials with correct responses might be nine or 10. In short, there is always a bit of uncertainty (or variability) around exactly how anyone will perform any skill on any given day. This lack of a perfect 1:1 relation between the environment and behavior makes the emission of behavior probabilistic. Behavior analysts have long discussed response probability (and its functional determinants) given its fundamental role in the science of behavior (Skinner, 1947, 1950). Behavior analysts often talk in terms of many aggregated trials (statistics!) such as the percentage of all trials with correct responding or the average number of responses emitted per minute. But, often in life we don’t get 10, 20, or 100 opportunities to do something. We get one chance to respond to our environment and get it right. Otherwise, life moves on. In these instances, we might ask, “What is the probability (or how likely is it) that the client gets any one trial correct?” An easy way to provide a calculated estimate of this probability is to simply take the number of total trials the client typically responds correctly and divide it by the total number of typical opportunities. So, if they consistently get four out of 10 trials correct during sessions, then our answer would be, “The probability the client will answer correctly on any single trial is 40%.”4 Probability made easy, right? As already highlighted in the previous paragraph, few responses that any biological organism emits are the exact same every single This is technically the “frequentist” interpretation of a probability. In Chapter 11, we’ll contrast the frequentist approach with the Bayesian approach. Because many behavior analysts are already familiar with presenting and collecting data on discrete trials, we suspect the frequentist approach is likely the most intuitive and easy to understand.
4
36
Statistics for Applied Behavior Analysis Practitioners and Researchers
time. From day to day, context to context, person to person, responding has some degree of variability to it. This is where distributions become useful. Let’s get at their utility with a slight spin on the question from the previous paragraph. Let’s say we have a savvy teacher who knows that responding is likely to vary from learning opportunity to learning opportunity. But, they want to know, if they were to present 10 learning trials to our client tomorrow, what is the probability that the client will get seven of those correct? Or four of those correct? Or all 10 correct? As you may have guessed, a simple count of the total number of correct responses divided by the total number of opportunities isn’t likely to help us here because responding naturally will vary from day to day. Probability distributions are one way to describe how the probability of responding might vary. Probability distributions plot the probability (y-axis) that a specific value (x-axis) can take based on the data we have collected up to that point in time. We will see many examples of what this looks like below. But, the point here is that probability distributions allow us to describe the probability that a particular amount, degree, type, or level of responding will be observed. And, in the same way that data can come in many different types, probability distributions can also come in many different types depending on—probably no surprise here—the type of data we are talking about and the question we might be asking. We’ll explain all this in a bit more detail using a first example situation where our data can take one of two values (i.e., binomial distributions). After that, we’ll show how this generalizes to other types of discrete and continuous distributions.
Discrete data distributions Binomial distribution Perhaps one of the simplest distributions we can talk about is the binomial distribution. As the name implies, this distribution is used when we have discrete, nominal data (nomial) which can take one of two (bi) values: binomial. As highlighted earlier, data that might be binomial could be whether responding occurred (yes or no) during an interval, whether a target behavior occurred (via one-hot encoding; 1 5 occurrence, 0 5 no occurrence) in the home or clinic, and whether presenting learning opportunities leads to aggression (yes or no). In each example, we have some total number of observations (n) and an
The requisite boring stuff part II: Data types and data distributions
37
observed frequency or probability (p) that the data collected was a “yes” or “occurrence.” Fig. 2.2 shows how different sets of binomial data can be translated into the binomial probability distribution. Probability distributions are useful because they give us a sense of the probability that we would observe the “target response” based on the number of total observations. Using the top row of Fig. 2.2 as an example, if we know the RBT will present 10 learning opportunities and, historically, 70% of those have been followed by aggression (top left panel), then we can estimate the probability that we will see aggression (top middle panel) on exactly 5 of the 10 trials as p 5 10% (point A), the probability that we will see aggression on exactly 7 of the 10 trials as p 5 27% (point B), or the probability that we will see aggression on
Figure 2.2 Collected data and the resulting binomial distributions. “A”, “B”, and “C” labels correspond to the intext example. BONUS EXTRA CREDIT: Why would it be inappropriate to connect the data markers for these data?
38
Statistics for Applied Behavior Analysis Practitioners and Researchers
exactly on all 10 trials as p 5 2.8% (point C). Such a distribution might be useful when communicating the likely range of trials during which an RBT, paraprofessional, or teacher might expect a target behavior to occur. For readers more comfortable with cumulative records, the top right panel plots the same data as the middle panel but cumulatively. This allows you to answer the question, what is the probability that we have seen all possible observations of the target behavior after n trials? The middle panel of Fig. 2.2 shows what happens when we change the total number of observations (n) but the probability that a “yes” is observed remains the same. There are two observations to note here. First, the peak of the probability distribution remains at the same “percentage” of total observations. That is, if the frequency we observed that a target behavior was 7 out of 10 or 14 out of 20, the peak is located at the same relative proportion of the total n on the xaxis (e.g., 70% of n for our example). This makes sense because the probability that a “yes” was observed remained unchanged. Second, the height of the peak gets smaller throughout. This makes sense. As the total number of observations increases, the probability we would observe exactly n target responses should decrease because there are many more values our observation can take. The bottom panel of Fig. 2.2 shows what happens when we change the total number of observations (n) and the probability that a “yes” is observed (p). As with above, increasing the total number of observations decreases the likelihood that any specific value is observed (i.e., the overall height of the curve is reduced). And, because the probability decreased that a “yes, our target behavior occurred” is observed, the probability distribution has shifted to the left because we are likely to have observed all instances of the behavior sooner, rather than later. One of the main characteristics about distributions is that different distributions are used depending on the data type we have AND the question we want to ask. The binomial distribution is great when our data are discrete, nominal, can take one of two values, and we’re interested in asking, “What is the probability we’ll observe our target behavior on n trials?” or, “After n trials, what is the probability we have observed all instances of the target behavior?”
The requisite boring stuff part II: Data types and data distributions
39
Geometric distribution Geometric distributions are great for asking a different question. Sometimes we might want to know the likelihood of when the first occurrence of the target behavior will occur. Knowing how many trials until the first instance of a target behavior is likely to occur might be extremely useful in planning the structure and sequence of intervention sessions. For example, the black bars in the second row of Table 2.3 shows the probability mass function and cumulative probability distribution for the geometric distribution wherein the target behavior occurs on 30% of trials (which is why the distributions start at 0.3 on trial 1). With each successive trial, the probability we will have gone that far without observing the target behavior becomes more unlikely which is why the probability reduces. And, imagine if you knew that after 5 trials of a specific program, the probability that target behavior would have been observed increases above 75%. The cumulative probability distribution gives you that. This information might be useful to share with an RBT who would then know to change programs after 34 trials if they desired to maintain the probability of observing the target behavior below a specific level. Geometric distributions do this for us. Negative binomial distribution Geometric distributions are technically a specialized instance of negative binomial distributions. Whereas geometric distributions ask, “What is the probability associated with n trials until the first instance of the target behavior occurs?,” negative binomial distributions ask, “What is the probability associated with n trials until y instances of the target behavior occur?” The third row in Table 2.3 shows what this looks like when we are asking about five total instances of the target behavior and there is a 0.30 chance the target behavior will occur on any one trial. If you want to get a better sense for how changing the probability and number of instances of the target behavior need to be observed, we recommend you check out the Excel document available here: https://github.com/david-j-cox/SupplMat-Stats-ABA/.
Continuous data distributions Normal distribution Perhaps the distribution most famous to behavior analysts and lay people alike is the normal distribution (aka bell curve, aka Gaussian distribution). The normal distribution is likely popular because it is
Table 2.3 Data distributions and the questions they might be used to answer. Question: Why are the discrete and continuous data types displayed the way they are? Distribution
Data type
Visual of probability
Visual of cumulative
Question designed
distribution
probability distribution
to answer
Binomial
Discrete, two possible values
What is the probability we will observe the target behavior on n of the trials?
Geometric
Discrete, two or more possible values
What is the probability we will go for n trials before we observe the first instance of a target behavior?
Negative binomial
Discrete, two or more possible values
What is the probability we will go for n trials before we observed y instances of the target behavior?
Normal (Gaussian)
Continuous, symmetric
What is the probability we’ll observe behavior at time n (or based on some other continuous variable)?
Lognormal
Continuous, asymmetric
What is the probability we'll observe behavior at time n (or based on some other positive real continuous variable)?
Poisson
Continuous, asymmetric
What is the probability of observing n target behaviors within an interval?
Exponential
Continuous, asymmetric
What is the probability of observing n time between events?
42
Statistics for Applied Behavior Analysis Practitioners and Researchers
particularly useful for at least three reasons. First, the normal distribution is popular because many different phenomena across the sciences follow the normal distribution. Here are a few examples of phenomena described well by the normal distribution: in physics, the probability a given particle in a bottle of gas has a particular velocity; in astronomy, the width of spectral lines; in chemistry, the weight of a single sugarcoated chocolate from a box of candies; in biology, the height and weight of members of a species; and, in behavior analysis, rate of responding at steady state. A second reason the normal distribution is popular is because of its statistical properties. The normal distribution is defined by two parameters5: the arithmetic mean (commonly referred to as the average value) and the standard deviation. The black line in the fourth row in Table 2.3 shows what the normal distribution looks like with a mean of 10 and a standard deviation of 2.5. If your data follow the normal distribution, then—without even doing any calculations—you know that approximately 68% of your observations will fall between one standard deviation less than the mean and one standard deviation greater than the mean; 95% of your observations will fall between 22 and 12 standard deviations from the mean; and 99% of your observations will fall between 23 and 13 standard deviations from the mean. Thus, once a scientist can estimate the mean and standard deviation of something they are interested in, they can easily predict and communicate with others the range of values that 68%, 95%, or 99% of future observations are likely to take. A third reason the normal distribution is popular is because of the central limit theorem. The central limit theorem is a neat little finding that—no matter what the distribution of the underlying data—if we collect a bunch of average values from a bunch of samples, the distribution of those averages will be normally distributed. As we will see throughout the later chapters, this becomes extremely useful for describing and predicting the values that future observations of natural phenomena take, behavior or otherwise. It also forms the foundation for many different statistical tests.
5 We’ll get more into parameters later. For now, you can just think of them as a fancy name for a number that we calculate to help us describe our dataset.
The requisite boring stuff part II: Data types and data distributions
43
Lognormal Though the normal distribution is certainly useful, not all continuous data that we collect are perfectly normal. In contrast, sometimes the data we collect are skewed in one direction or another. By skewed, we just mean that the distribution is not symmetrical around the mean value: if you were to “fold the distribution in half,” the two sides would not line up. When the bulkier “hill” is more to the left, we call this positively skewed; and when the bulkier “hill” of our data falls more to the right, we call this negatively skewed. Further, the normal distribution technically extends forever in the positive and negative directions. However, sometimes the data we collect can never be negative. For one example of data that meets all these criteria, perhaps we observe target responding on a few trials but sometimes we observe it on many trials (this would be the skewed bit). And, because we can never have negative rates of responding, the normal distribution wouldn’t work because it technically can predict positive and negative response rates. But all of these criteria are handled nicely by the lognormal distribution in which all values are positive. And, if the lognormal distribution fits your data well, then you can take the natural log of all the values and the resulting data will be normally distributed allowing you to take advantage of all the wonderful things that come with normal distributions. The fifth row in Table 2.3 shows lognormal distributions with varying means and standard deviations. Poisson distribution Poisson distributions are also useful for describing and predicting continuous data that are asymmetric and can only take on positive values. Poisson distributions are great for answering the question, “What is the probability we’ll observe n responses within some time interval?” The sixth row of Table 2.3 shows what distributions look like for varying average numbers of responses that are observed within an interval. And, if you’re wondering how a probability distribution can be continuous when we’re talking about the number of responses, remember that time is a continuous variable. So combining a discrete variable with a continuous variable gives us a continuous variable. Exponential distribution To round out our list of common distributions, we’ll stick with our interest in time. However, rather than being interested in the number of responses we might observe within some time interval, the
44
Statistics for Applied Behavior Analysis Practitioners and Researchers
exponential distribution is useful for plotting the probability that a certain amount of time will pass between occurrences of a target behavior (interresponse time) or some kind of environmental stimulus. The sixth row of Table 2.3 shows us what exponential distributions look like. As can be seen by the plots, these distributions continuously decrease as the time between events increases. This makes sense because we would expect long intervals between events to be less likely than earlier intervals. Here, each line represents a different probability that a certain amount of time passes between the occurrence of events.
Quick recap and resituating ourselves We have covered a lot of information in this chapter. To recap, data can come in many different types. The big ones that behavior analysts likely handle regularly can be separated into two categories: discrete and continuous. Discrete data are data that only can take on whole values. Discrete data can be broken into three smaller categories. Nominal discrete data provide no quantitative information in their raw format and might be thought of as simple category labels (e.g., colors, nationality, state of residence). Ordinal discrete data have a natural order to them such that one can be said to be larger or smaller than another, but the degree or size of difference between each discrete value is unknown (e.g., education level, income category). Quantitative discrete data also have a natural order and we know the relative degree or size of difference between each discrete value (e.g., number of students in a class, clients on caseload, exact annual income). Continuous data are data that can take on any real number value (can be a decimal) between negative infinity (or zero) and positive infinity, have a natural order, and provide a magnitude or degree of difference between values (e.g., temperature, height, average responses per minute). Continuous data are the most precise type of data and, therefore, behavior analysts often try to convert their data into continuous values if at all possible. We also discussed data distributions. Much of operant and respondent behavior is—by definition—probabilistic. That is, responding does not happen every single time someone contacts an evoking or eliciting stimulus. Relatedly, biological organisms do not respond in exactly the same way every single time. There is always some degree of variability in the rate, duration, latency, topography, or force of
The requisite boring stuff part II: Data types and data distributions
45
responding. Thus, if we are interested in describing and predicting when and how much responding will occur, we need to understand how best to characterize responding based on the data we have (i.e., the data type; the shape of the data collected via comparison to probability distributions) and the question we are trying to ask (i.e., the probability distributions). Using the wrong probability distribution to characterize your descriptions and predictions about responding is akin to measuring the wrong thing during a functional analysis. You might find some interesting patterns in the data, but it is unlikely to be telling you what you think it is. To help make practical sense of all this, Fig. 2.3 shows a simple decision tree to help you identify the likely distribution of your data based on the broad data types you have collected and the question you might be interested in asking. In addition to following the decision tree in Fig. 2.3, you can also simply plot a histogram of your data, take a look at its shape, and compare it to the distributions in Table 2.3. One quick and very important note is that we covered only seven different probability distributions. For relative reference, Wikipedia lists 30 1
Figure 2.3 Decision tree for identifying the statistical distribution likely to describe your data. Please check the online version to view the color image of the figure.
46
Statistics for Applied Behavior Analysis Practitioners and Researchers
discrete distributions, 110 1 continuous distributions, two mixed discrete/continuous distributions, 15 1 joint distributions, one nonnumeric distribution, and five miscellaneous distributions. That means that Fig. 2.3 is necessarily incomplete. We chose the distributions likely to be used by behavior analysts based on the common data we collect. But, this roller coaster has many more loops and the rabbit hole is deep for those who love the thrill of numbers. To return to the larger picture, a fair question is why behavior analysts should care about data types and probability distributions? Well, there are at least two reasons that follow from the main purposes that scientists and practitioners seem to use statistics today. The first reason we should care about data types and data distributions is to accurately describe and talk about the data we have collected (i.e., descriptive statistics; modeling one data type). Behavior analysts already know the importance of accurately turning observations of behaviorenvironment relations into numbers so they can do something with it (e.g., calculate response rate, plot data visually). But, not all numbers are made equal. If we are going to describe our data we need to do it accurately. People can’t wear 1.5 socks per day, there cannot be an average favorite color of students in a classroom, and the percentage of correct responses out of 10 trials is a discrete data type even though it might be presented in decimal format. Accurate use of our data matters as much as accurate collection of our data. In Chapters 3 and 4, we’ll discuss how the data type and distribution influence how we describe the central tendency and the variability of our data. The second common use of our data is to make some kind of inference about the question for which we collected data in the first place (i.e., inferential statistics; modeling multiple data types). Did our intervention actually change behavior? Do the functional analysis results really show behavior is attention and escape maintained? How much of a change in performance on norm-referenced and criterion-referenced tests can be attributed to the services we delivered compared to if those services were not delivered? Do one behavior analyst’s clients show more overall progress than another behavior analyst’s, or do the differences among clients on their caseloads play a role? Do the clients receiving services at one clinic differ in important ways from the clients receiving services at different clinics? Based on the skill sets of my employees, how many hours of intervention are truly necessary for a client based on their clinical profile?
The requisite boring stuff part II: Data types and data distributions
47
In each situation in the preceding paragraph, we shifted from describing one data type to comparing many data types. As a roadmap, Chapter 5 discusses how data types and distributions relate to intervention effect sizes, statistical significance, and social significance; Chapter 6 discusses how data types and distributions allow us to quantify the influence of many variables (experimentally controlled or not) on behavior; and, Chapter 7 talks about the exciting issue of how many observations we should probably make before we start opening our mouth and making claims about our data. Perhaps most excitingly, if you have gotten to this point, then we can leave the requisite boring stuff behind and get to the really exciting stuff around the practical use of statistics in ABA.
References Branch, M. (2014). Malignant side effects of null-hypothesis significance testing. Theory & Psychology, 24(2), 256277. Available from https://doi.org/10.1177/0959354314525282. Brodhead, M. T., Durkin, L., & Bloom, S. E. (2014). Cultural and linguistic diversity in recent verbal behavior research on individuals with disabilities: A review and implications for research and practice. The Analysis of Verbal Behavior, 30(1), 7586. Available from https://doi.org/ 10.1007/s40616-014-0009-8. Feynman, R. P. (1965). Chapter 3: The relation of physics to other sciences, . The Feynman lectures on physics (Volume 1). Basic Books. Jones, S. H., St., Peter, C. C., & Ruckle, M. M. (2020). Reporting of demographic variables. Journal of Applied Behavior Analysis, 53(3), 13041315. Available from https://doi.org/10.1002/ jaba.722. Li, A., Wallace, L., Ehrhardt, K. E., & Poling, A. (2017). Reporting participant characteristics in intervention articles published in five behavior-analytic journals, 20132015. Behavior Analysis: Research and Practice, 17(1), 8491. Available from https://doi.org/10.1037/bar0000071. Merriam-Webster. (2021). Statistics. Retrieved from the website: https://www.merriam-webster. com/dictionary/statistics. Sidman, M. (1960). Tactics of scientific research: Evaluating experimental data in psychology. Basic Books. Skinner, B. F. (1938). The behavior of organisms: An experimental analysis. Appleton-CenturyCrofts. Skinner, B. F. (1947). Current trends in experimental psychology. In W. Denis (Ed.), Current Trends in psychology (pp. 1649). University of Pittsburgh Press. Skinner, B. F. (1950). Are theories of learning necessary? Psychological Review, 57(4), 193216. Available from https://doi.org/10.1037/h0054367. Seuss, D. (1984). The butter battle book. Random House, ISBN: 0-394-86716-5. Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on p-values: Context, process, and purpose. The American Statistician, 70(2), 129133. Available from https://doi.org/10.1080/ 00031305.2016.1154108.
48
Statistics for Applied Behavior Analysis Practitioners and Researchers
Supplemental: Probability distribution equations Binomial probability distribution
n x nx Px 5 p q x
(2.1)
• P is the binomial probability that we would observe the target behavior on x of the trials. n • is the number of possible arrangements that the sequence of x trials might take. This is calculated using the following factorial equation n!/(r!(n-r)!) where n is the total number of objects in the set and r is the number of objects chosen from the set. Many online calculators exist for the readers who are not interested in hand calculating this out in Excel. • p is the probability the target behavior is emitted on each trial. • q is the probability of not observing the target behavior on a single trial (i.e., 1 2 p). • n is the total number of trials. The shape of the binomial distribution depends on the variables p and n making these the primary parameters that define this distribution. Geometric probability distribution Px 5 ð12pÞx21 p
(2.2)
• Px is the probability that we would go x trials before we observe the first instance of the target behavior. • p is the probability the target behavior is emitted on each trial. The shape of the geometric distribution depends on the variable p making it the primary parameter that defines this distribution. Negative binomial probability distribution ðk 1 r1Þ! ð12pÞk pr Px 5 ðr1Þ!ðk Þ!
(2.3)
• Px is the probability that we would go x trials before we observe r instances of the target behavior. • k is the number of failures (i.e., trials without the target behavior). • p is the probability the target behavior is emitted on each trial.
The requisite boring stuff part II: Data types and data distributions
49
The shape of the geometric distribution depends on the variables p and r making them the primary parameters that define this distribution. Normal probability distribution 1 1 x2μ 2 Px 5 pffiffiffiffiffiffi e22ð σ Þ σ 2π
(2.4)
• Px is the probability that we will observe the behavior at time x (or based on some other continuous variable). qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P • μ is the arithmetic mean of our data. ðxi 2μÞ2 • σ is the standard deviation of our data and is calculated: . N • N is the number of observations in the population (changes to n 2 1 with samples). • π is pi3.14149. • e is Euler’s constant2.71828. The shape of the normal distribution depends on the variables μ and σ making them the primary parameters that define this distribution. Lognormal probability distribution Px 5
ðlnðxÞ2μÞ2 1 pffiffiffiffiffiffi eð2 2σ2 Þ xσ 2π
(2.5)
• Px is the probability that we will observe the behavior at time x (or based on some other continuous variable). qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P • μ is the arithmetic mean of our data. ðxi 2μÞ2 • σ is the standard deviation of our data and is calculated: . N • ln is the natural logarithm. • π is pi3.14149. • e is Euler’s constant2.71828. The shape of the normal distribution depends on the variables μ and σ making them the primary parameters that define this distribution. Poisson probability distribution Px 5
e2λ λx x!
(2.6)
• Px is the probability that we will observe λ instances of behavior during interval x. • e is Euler’s constant2.71828.
50
Statistics for Applied Behavior Analysis Practitioners and Researchers
The shape of the normal distribution depends on the variable λ making it the primary parameter that defines this distribution. Exponential probability distribution λx x$0 λe Px 5 x,0 0
(2.7)
• Px is the probability that we will observe x time between events. • λ is the average rate of responding. • e is Euler’s constant2.71828. The shape of the normal distribution depends on the variable λ making it the primary parameter that defines this distribution.
CHAPTER
3
How can we describe our data with numbers? Central tendency and point estimates Why did the student get upset when his teacher called him average? It was a ‘mean’ thing to say.
Introduction Welcome back to another exciting edition of “Fun with numbers in ABA! A data love story!” We’re your hosts, Jason and David. In the first two chapters, we covered some of the less hands-on topics at the intersection of statistics and behavior analysis. But, it was for good reason. Without a solid understanding of the topics covered, you might talk about and play with your data in inappropriate ways. For example, you might snobbishly claim that statistics have no place in behavior analysis, talk casually about the average number of sock colors that people wear, or argue furiously that the count out of five trials with correct responses is a continuous number. If the examples in that last sentence sound like something you might do, you may want to go back and reread Chapters 1 and 2 (or read them for the first time if you skipped over them). However, if all of this is boring repetition and you feel pretty good with the first two chapters, feel free to skip ahead to the section titled, “What is next on the agenda?” Otherwise, here is a brief recap of where we are on our beautiful map of numberland. In Chapter 1, we defined statistics as a branch of mathematics whose topic is the collection, analysis, interpretation, and presentation of aggregate quantitative data. Behavior analysts rarely use data in its raw format. Often we aggregate the data in some way (e.g., count up the number of times a response occurs), combine it with other data to present it visually (e.g., session number; intervention condition), and then analyze or interpret the resulting visual stimuli relative to operant and respondent models of behavior (e.g., three/four-term contingency). Yes, indeed, responses per minute, percentage of trials with correct responses, Statistics for Applied Behavior Analysis Practitioners and Researchers. DOI: https://doi.org/10.1016/B978-0-323-99885-7.00011-8 © 2023 Elsevier Inc. All rights reserved.
52
Statistics for Applied Behavior Analysis Practitioners and Researchers
and even the cumulative record are all—by definition—statistics. Many behavior analysts have inaccurate stimulus associations with the word “statistics” and so we also reviewed common myths and misconceptions about statistics. In Chapter 2, we covered the rest of the requisite boring stuff. Here, we met all sorts of exotic data types falling under the umbrella labels of discrete and continuous. Nominal discrete data provide no quantitative information (aka category labels). Ordinal discrete data have a natural order but no size difference. Quantitative discrete data have a natural order and a size difference. And, continuous data (the most analytically useful) can take on any real number value between negative and positive infinity. In Chapter 2, we also met a handful of common probability distributions that are useful for predicting response probability. Different probability distributions are more or less useful depending on the type of data we have, the range of values it can take, and the general shape of our data (i.e., how many do we have of each different value in our data). And, relating back to the examples on socks and count of trials with correct responses, some distributions are logically impossible for certain data types.
What is next on the agenda? In this chapter, we begin our pivot toward more practical, everyday use of statistics by behavior analysts. We begin by focusing on one of the most common topics when describing natural events using numbers—point estimates and central tendency. Point estimates can be defined as a single number that represents an aggregation of datum (e.g., Dodge, 2010). Arguably, the most common types of point estimates are those of central tendency—a single number that represents the central position of the aggregated data (Dodge, 2010). Stated differently, if you had to pick one number to represent all your data, it seems logical to pick the one most likely to occur—central tendency. To cover this topic, we’ll start by reviewing why point estimates and central tendency are useful, how any dataset allows you to derive these at differing levels, and how determining the “right” level depends on your data and use case. After that, we’ll review the most common point estimates of central tendency found in the wild and how they are calculated. Lastly, we’ll try to make this all practically tractable by
How can we describe our data with numbers? Central tendency and point estimates
53
reviewing common use cases where behavior analysts estimate central tendency and how simply doing what others do can lead you into numerically illogical hot waters.
High-level overview of the why and the how Only in rare circumstances are the raw collected data likely to be of practical help to behavior analysts. Typically, behavior analysts aggregate the data in one way or another to better understand a client’s or research participant’s behavior over time. For example, Fig. 3.1 shows the exact same data across all three panels. In the top panel, each response is plotted, response-by-response, across sessions for a hypothetical client. The middle panel shows the exact same data, but aggregated at the session level. And, the bottom panel shows the exact same data but aggregated by condition (i.e., baseline or intervention). Unless you are particularly adept at reading bar codes and transforming data between different types of visualizations, these different levels of summarization will likely exert different influences on you as a viewer. Importantly, because different visualizations have different effects on the viewer, and because we use data to answer all sorts of different questions, each has merits and drawbacks that should be considered when selecting which one to employ. Behavior analysts will have likely interacted with the top and bottom plots to a lesser degree. In contrast, the plot in the middle aggregates data per session and is likely contacted more regularly by behavior analysts in their everyday lives. The downside to aggregating data to create a single numerical representation of multiple observations is that we lose information. For example, using the top panel in Fig. 3.1, a behavior analyst knows exactly when during each session each response occurred. In principle, the x-axis could also be expanded and the behavior analyst could count the number of responses per session—thus providing identical information to the right panel in the bottom plot. In contrast, using the bottom right barplot in Fig. 3.1, a behavior analyst would be unable to tell you exactly when during each session each response occurred. Did responding occur all at the beginning session, all at the end, was responding interspersed evenly throughout, or did it occur in three bursts at random points during the session? Who knows. All we know is the overall number that occurred per condition.
Figure 3.1 Demonstration of different levels for aggregating responding. The top panel shows the data without aggregation by displaying when each response occurred, minute-by-minute, for 30-min sessions, and for 10 sessions. The middle panel shows the same data but aggregated at the session level. The bottom panel shows the same data but aggregated at the condition level (baseline or intervention) as well as via two different methods for aggregating.
How can we describe our data with numbers? Central tendency and point estimates
55
The loss of information through data aggregation is not inherently problematic. Fig. 3.1 highlights this well. With too much information (top panel), it may be challenging to make sense of the data we have collected and to use those data practically to help our client. With too little information (bottom panel), we miss the downward trend in the data during the intervention condition (if aggregating by average responding per session) and we miss out on the intervention effect (if aggregating by total responding in a condition). However, by aggregating responses at the total count per session level, the middle panel offers the right practical balance of information the behavior analyst can act upon. So what makes a good summary statistic? How do behavior analysts know the right level of aggregation to be practically useful? Unfortunately, there is no quick and ready answer to this question. Choosing the right level of aggregation depends on how easy it is to create the summary statistic, the questions the data are being used to answer, how the data are structured and distributed (Chapter 2), the extent to which the behavior analyst is trying to compare their dataset to previous datasets (e.g., peer-reviewed published research, previous graphs/ data for the same or similar clients/responses), and who their audience is. Given these nuances, it’s likely better for you to know generally the tools at your disposal and when they are and are not appropriate. So, without further adieu, let’s introduce you to the key numerical players.
Common descriptions of central tendency As discussed throughout the book to this point, behavior analysts use statistics often to aggregate their data. Typically, this involves wanting to identify a single number that represents the central or middle value of a set of data. To this end, behavior analysts are likely familiar with three point estimates of central tendency: the arithmetic mean (commonly referred to as the average), the median, and the mode. These measures do not represent the totality of ways to quantify central tendency or the middle value. For example, means come in other flavors such as geometric, harmonic, and trimmed. Nevertheless, the arithmetic mean, median, and mode likely represent the most common measures behavior analysts will employ in their work and are the foundation of commonly used measures of behavior such as rate of responding and percentage.
56
Statistics for Applied Behavior Analysis Practitioners and Researchers
Arithmetic mean Readers will likely be most familiar with the arithmetic mean1 as a measure of central tendency. The arithmetic mean is an appropriate measure of central tendency only when two conditions are met. First, the arithmetic mean is appropriate only when working with continuous data (see Fig. 3.3; also see discussion on the types of variables in Chapter 2). Second, the arithmetic mean is appropriate only when the variability in our data is symmetrically distributed (i.e., the data distribution is not skewed; Chapter 2)2. If your data can only come in whole numbers (i.e., is discrete) or your data are not symmetric (e.g., normally distributed, uniformly distributed), then the arithmetic mean is likely not the correct measure for you. Calculating the arithmetic mean is straightforward. First, you add or sum all values in your dataset together. Second, you divide the sum of the values by the number of observations. That’s it! Easy peasy! The more formal ways to depict this calculation are shown in Eqs. (3.1) and (3.2). In both equations, x bar (x) represents the arithmetic mean, xi represents each value in the dataset, and n represents the number of observations. x1 1 x2 1 ? 1 xn x5 (3.1) n Eq. (3.1) also has that little ? in the middle of it. That just means “do the same thing for all values between the second number and nth number.” Like the rest of us, mathematicians will emit less response effort when possible. Thus Eq. (3.2) shows another way to write, literally, the exact same set of calculations. Here, we use the sigma (Σ) symbol in the numerator which just means to “sum all values” of x in the dataset (i.e., the same thing as the numerator in Eq. (3.1)). P x (3.2) x5 n 1
Some authors use the terms average and arithmetic mean interchangeably. Yet, in some contexts, the term average is used to refer to any measure of central tendency. For clarity, we will use the term arithmetic mean and avoid the term average. Further, we specify arithmetic to differentiate the mean most frequently used from other types of means that can be calculated. Though a complete discussion of all types of means is beyond the scope of this text, we direct interested readers to Manikandan (2011) for a review of the many means for calculating means. 2 Analyzing time series data adds one more assumption to this mix. We won’t discuss that here to keep things simple. But we’ll certainly bring it back up in Chapter 8.
How can we describe our data with numbers? Central tendency and point estimates
57
Table 3.1 Social initiations. Social initiations (no outlier)
Social initiations (outlier)
3
3
4
4
3
3
6
6
5
5
3
3
4
4
2
2
4
4
5
36
Later in this chapter, we will describe examples wherein behavior analysts use the arithmetic mean as a statistical representation of their data. Here, we’ll simply note that the measures of rate of responding and latency to respond are oft-used measures where behavior analysts are calculating the arithmetic mean (though not always appropriately). As noted previously, the arithmetic mean is not appropriate when our data are not symmetric such as it being skewed or containing outliers. To illustrate why, consider a behavior analyst who is interested in the number of social initiations their client emits and so collects data over 10 observations (see left column, Table 3.1). Because talking about that array of data is cumbersome, they decide to describe their data using a single number and choose the arithmetic mean. Using the handy-dandy equations above, they sum up all the numbers (3 1 4 1 3 1 6 1 5 1 3 1 4 1 2 1 4 1 5 5 39), divide by the total number of observations (39/10), and find the arithmetic mean of social initiations per observation is 3.93. The astute reader will have also noticed that Table 3.1 contains a second column of identical values, except the last value. This second set of values is provided to illustrate when calculating and reporting the arithmetic mean is likely not wise. In situations in which a dataset contains an outlier—a value unusual or far away from the other values—the arithmetic mean is not appropriate. Outliers have the effect of leading to an arithmetic mean that is not particularly representative of the majority of values in a dataset. And the further the outlier is from the rest of the 3
We also wish to point out that it is best practice to report the arithmetic mean or any measure of central tendency in conjunction with some measure of variability—see Chapter 4 for how to do this right.
58
Statistics for Applied Behavior Analysis Practitioners and Researchers
data, the bigger the unwanted influence. Returning to our example, if the behavior analyst collected data on social initiations for 10 observations but with the large outlier—right column, Table 3.1—the arithmetic mean balloons to 7.0 social initiations. This is problematic because the arithmetic mean is now larger than nine of the 10 values! In most cases, this would not be a very accurate description of this client’s responding.
Other types of means As described above, the arithmetic mean is only appropriate when your data are continuous and symmetric (i.e., not skewed, no outliers). In the following, we discuss the median as another measure of central tendency often useful in these situations. But, to provide you with some exposure to the broader central tendency landscape, we’ll review the conditions under which two sets of alternative means might be a more accurate method to describe the central tendency of your data. We won’t go into the details on how to calculate these (as Google and Excel are your friend and make this easy). Instead we’ll just highlight which alternative mean is likely best based on the data you have and the distribution they take. The first set of alternative means would be used when your data are not symmetric and the edges of your data do not contain too many outliers. That is, your data are skewed in some way but the edges of the distribution are still somewhat reasonable such as with lognormal or Poisson distributions. In these instances, the arithmetic mean is likely to be biased due to the skew of the data (Fig. 3.2). For those who are still craving to use some kind of mean, the geometric mean
Symmetric Data Arithmetic Mean, Median, Mode
Skewed Data Mode
Median
Arithmetic Mean
Figure 3.2 Use cases for measures of central tendency based on data type. “X” means likely inappropriate; “O” means likely appropriate; and “?” means you should probably think twice about whether it’s appropriate or not.
How can we describe our data with numbers? Central tendency and point estimates
59
and harmonic mean are two common alternatives. Geometric means are great when your data are exponential or lognormal in its shape (see Chapter 2). Harmonic means are often used when an average rate of something per time is what you’re trying to describe (response rate anyone?). The second set of alternative means should be used when your data are roughly symmetric, but you have outliers above or below the rest such that your arithmetic mean is biased in some way. In these situations, you can use trimmed means. Trimmed means are exactly like they sound. You “trim out” (i.e., drop) a percentage of your data on both the low and high ends of your data (you don’t get to pick just one side). For example, if you calculated the 5% trimmed mean you would remove the largest 5% and smallest 5% values in your data. Then you would calculate the arithmetic mean of the remaining data. Using the right column in Table 3.1 as an example, with 10 observations we can only drop the top and bottom 10% of our data. So, out the door go the values of 2 and 36. The resulting arithmetic mean is 4.0—not bad! Though behavior analysts have described their data using the geometric mean (e.g., Fox, Smethells, & Reilly, 2013), harmonic mean (e.g., Shimp, 1969), and trimmed mean (e.g., Killeen, 2019), these alternative means are not common in practice. We include them here to expose behavior analysts to alternative ways to describe their data. Who knows, by exploring alternative ways to describe and communicate about the measures of behavior-environment relations you observe, you might just find a way to break from tradition that better meets the function of why you are collecting, aggregating, analyzing, and communicating about data in the first place!
Median Flowing logically from our discussion of the influence of skewed distributions and outliers on the arithmetic mean, you may wonder, “What measure of central tendency should I employ?” Your answer is often the median. The median is the middle value from a dataset. Calculating the median is straightforward. For a dataset containing an odd number of values, you first rank the values in your dataset from smallest to largest. Second, you identify the value that is exactly in the middle and that number is the median value. For a dataset containing an even number of values, you first rank the values in your dataset from smallest to
60
Statistics for Applied Behavior Analysis Practitioners and Researchers
Table 3.2 Social initiations. Social initiations (no outlier)
Social initiations (outlier)
2, 3, 3, 3, 4, 4, 4, 5, 5, 6
2, 3, 3, 3, 4, 4, 4, 5, 6, 36
largest. Second, you identify the middle two values. Third, you calculate the (arithmetic) mean of the two middle values and that arithmetic average is the median value. Boom! Median made manageable! To demonstrate, let’s turn again to the behavior analyst interested in the number of social initiations emitted by their client. We can calculate the median for the dataset obtained from the 10 observations by first ranking the values from the smallest to the largest. In doing so, we get the string of values in the left column of Table 3.2. Thereafter, because it contains an even number of observations, we identify the middle two values (4, 4) and calculate the arithmetic mean of these two values. The resulting median is 4. Recall the arithmetic mean for this same set of values was 3.9. Thus the arithmetic mean and median are nearly identical and either is likely to be a justifiable representative measure of central tendency (though remember our discussion around discrete vs. continuous data in Chapter 2). Now let’s consider the nearly identical social initiation dataset but containing that pesky outlier. To get this median, we work through a similar process of ranking the values from the smallest to the largest (Table 3.2; right column), identifying the middle two values (4, 4), and calculating the arithmetic mean of these two values. The resulting median is, again, 4 (which is identical to the median for the dataset without the outlier)! In contrast, recall that the arithmetic mean for this same dataset was 7.0. Nice! In sum, this demonstrates how the presence of an outlier often has little-to-no influence on the median, but a pronounced influence on the arithmetic mean.
Mode Sometimes both the median and the arithmetic mean are not ideal descriptions of our data. Fig. 3.2 shows two graphs that combine the examples above and the notion of response probabilities from Chapter 2. The left plot shows what we saw previously. When the data are roughly symmetric and there are no outliers, the arithmetic mean and the median are very similar. However, when our data are skewed
How can we describe our data with numbers? Central tendency and point estimates
61
or we have outliers, the arithmetic mean is biased in the direction of the skew or the outliers. As we saw earlier, often the median does a pretty good job capturing the central tendency of our data. But, in some situations where the data have a lot of skew, even the median might not describe our data as well as we’d like. Enter: the mode. The mode is the most popular or commonly occurring value in a set of data, thus requiring no special equations. Instead, you just count how many times each unique value in your data shows up. The one with the most wins the title of mode. Unlike the arithmetic mean and the median, the mode can technically be used when working with datasets involving any type of variable. That said, the mode would likely not be informative when working with continuous variables in which highly accurate measurement occurs. This is because there is a decreased likelihood that you will obtain an identical value multiple times (e.g., latency or duration measured to the ten-thousandths) and all or most values will occur only once or twice.
Summary of measures of central tendency Fig. 3.3 summarizes graphically what we have covered to this point in the chapter. To summarize with words, behavior analysts rarely analyze the raw data that they collect. Instead, they aggregate the data in some way so that they can then graph the data or to talk about the effects of their intervention. When we aggregate data at any level beyond its raw form, we necessarily lose some of the detail around exactly what behavior looked like and the conditions under which it occurred. Thus, when choosing how to aggregate a set of data, we often want to choose the metric that most closely aligns with the rest of the data it is meant to represent. In statistical jargon, we want to choose a measure that best represents the central tendency of our data.
Figure 3.3 Demonstration of how well different types of means represent data of different distributions.
62
Statistics for Applied Behavior Analysis Practitioners and Researchers
Commonly used measures of central tendency include the arithmetic mean, median, and mode (though other more exotic metrics exist such as the geometric, harmonic, and trimmed means!). The arithmetic mean is a great choice when our data are continuous and symmetric. Here, the single number we end up using differs from the actual data by about the same overall amount above and below our number. But, logically, it makes no sense to talk about the arithmetic mean for such information as function maintaining behavior (discrete nominal) or therapist education level (discrete ordinal). Even talking about the arithmetic mean for the number of students in a classroom or responses per minute (both discrete quantitative) can become questionable depending on what we’re doing because students and responses cannot come in partial amounts. When our data are discrete, there are outliers, or it is significantly skewed, the median and mode become great choices. Here, we select a measure of central tendency depending on whether the variable of interest is nominal (mode is your friend), how many levels of our value we have (discrete ordinal; median or mode might work), and the overall skewness/presence of outliers (discrete quantitative; median or mode might work).
Examples of how reporting central tendency in applied behavior analysis can be fun To this point, we have made the examples and descriptions about all of this pretty straightforward. Chalk that up to brilliant explanatory writing. ;-) Only kidding, of course. The reason is because we wanted to make sure you had solid ground beneath your feet before we yanked the rug out. The reality is that things get messy (er. . .fun!) when we aggregate data on behavior-environment relations in behavior analysis. This is because each method for aggregating the exact same data might be inappropriate based on the function of our analysis in one context but perfectly splendid in a different context. As with everything else, you should consider the function—the reason why—you are doing what you are doing with your data. Through the following examples, we hope to demonstrate how you can think critically about the data you collect and how you analyze it. Importantly, just because one method for graphing point estimates of central tendency has traditionally been used in behavior analysis doesn’t mean that others are not useful (or better!). We strongly encourage you to take an exploratory approach to playing with your data. Just like with Skinner when he
How can we describe our data with numbers? Central tendency and point estimates
63
first saw the cumulative record, you never know what you might find by thinking differently about how you look at your data.
Rate4 of responding Perhaps the most familiar measure of central tendency to behavior analysts is rate of responding. Rate of responding involves combining two different data types. One is a count of the number of times behavior occurs (discrete quantitative data) such as the number of correct responses, mands, and social initiations. Simply counting and reporting on the number of times behavior occurs gives an incomplete picture of responding. For example, consider if Jason reported, “David emitted 5, 18, and 75 mands across our last three Zoom calls.” Your initial response might be, “Wow, David’s emitting more and more mands! Great job, David!”5 Now consider how your perspective changes if Jason also said those Zoom calls were 5, 15, and 60 minutes, respectively. Is David’s rate of manding actually increasing? The abovementioned uncertainty is why we add that second data type in duration of the observation (continuous data type). In the above example, the duration was captured in minutes. But, duration might be captured in seconds, minutes, hours, and so on. Adding the duration of each observation leads to a more appropriate interpretation of the collected data by calculating a rate of responding. Most commonly, behavior analysts calculate rate of responding by dividing the measured count of responding by the duration of the observation window. Returning to our example, we can whip out our trusty Texas Instrument 30XS calculator in sapphire blue, plug in the data, let the calculator do its magic, and report that David emitted mands at a rate of 1.0, 1.2, and 1.25 per minute over those last three Zoom meetings. At this point, you may be thinking that everything in the above paragraph seems straightforward. What could be so dastardly about response rate as to justify the intro paragraph to this section? Well 4
A thorough handling of terminology issues is well beyond the scope of this chapter. For a more thorough treatment, we direct the reader to writings around the disagreement amongst behavior analysts about how to use the terms count, frequency, and rate. Some have argued that frequency should be used to denote rate to align with how frequency is often used in other natural sciences (see Johnston et al., 2020). Yet, others have recommended frequency be used to denote count to align with usage common in behavior analytic journals and textbooks with rate being used to denote count divided by time (Carr, Nosik, & Luke, 2018). 5 All due to Jason’s perfectly planned and executed behavior plan for him. We know, we know: David needs a lot of work.
64
Statistics for Applied Behavior Analysis Practitioners and Researchers
consider what you would say if someone were to ask you to predict how many mands David will emit in the next 5 minutes. It wouldn’t make much sense to say 6.25 mands (5 3 1.25) as he can only emit a whole number of mands. Similarly, if someone asked you to describe how often David emits mands, it also wouldn’t make much sense to talk about partial responses. In short, calculating the arithmetic mean of responses per minute really isn’t a logical description of behavior nor is it useful in making predictions about behavior. Does this mean that all past practitioners and researchers who have published on the rate of responding are doing it wrong? Of course not! Transforming discrete quantitative data to continuous quantitative data is useful for at least two reasons. First, as discussed in Chapter 2, continuous quantitative data are the most flexible and analytically nuanced data type. As such, scientists will try to convert their data to this data type whenever possible. For the rate of responding, transforming our data from discrete to continuous is particularly useful when we want to analyze the effect of an intervention between baseline and intervention conditions. Converting counts to a continuous measure allows subtle differences between intervention and baseline to be easier to see. Again, note the function here though. Discrete quantitative data are converted to continuous quantitative data to analyze the effect of a contingency on responding—not to provide an accurate description or prediction about behavior. The second reason that converting our counts of behavior to rate of responding is useful is so we can compare apples to apples. As noted above, each observation lasted for varying durations of time. Without putting each count of behavior on the same “playing field,” comparing different counts was not quite right. This likely feels obvious to you. But we’ll state this observation more generally as it will become really important as we move to later chapters. In many situations, the units that our raw data comes in are not logically the same. And, we can only begin to aggregate and analyze data if it has been converted to the same unit. With response rate, we are converting all counts to a standardized “count per minute.” Below, we’ll review this idea again around percentages. And, in later chapters, we’ll get a bit more wild with it. To close out our conversation around the conditions under which response rate (as calculated above) may and may not be appropriate, note that nowhere did we check whether our data were symmetric.
How can we describe our data with numbers? Central tendency and point estimates
65
Why does this matter? Well, technically, calculating the rate of responding as described above is equivalent to calculating the arithmetic mean of the number of responses per minute. Rate of responding is equivalent to answering the question, “on average, how many responses will we observe per minute?” As we saw above, the arithmetic mean is an accurate measure of central tendency only if your data are symmetric. If responding is not symmetrically distributed across all minutes of your observation window, the arithmetic mean might not be an accurate statistic to use. As Table 3.1 showed, perhaps one portion of your observation window skews the rate of responding away from patterns of behavior that are much more likely. Thus your plotted rate of responding may be unduly influenced by an outlier portion of your observation window. Some examples from the published literature might help to highlight how researchers use different measures of central tendency to report on occurrences of responding. As one example of researchers using the arithmetic mean, Landa et al. (2022) evaluated the influence of prompting a functional communication response (FCR) following problem behavior for participants who exhibited severe problem behavior. The researchers collected the number of occurrences of problem behavior and FCRs emitted by participants during each functional analysis and functional communication training session. They then aggregated their data (derived a statistic!) by dividing the count of each response by the session duration in minutes. These derived statistics were then put into graphical displays as the rate of problem behavior or FCRs per minute. In a second example of researchers using the arithmetic mean, Sloman et al. (2022) compared the influence of chained and multiple schedules on vocal stereotypy. Although not directly targeted, the researchers were also interested in the influence of their procedures on compliance with mastered tasks. As such, the researchers collected data on the number of occurrences of compliance during the observation periods. They then divided these counts by the total number of minutes in the observation window and statistically summarized responding as responses per minute. Lastly, they displayed these statistics graphically to be analyzed visually. Sometimes the median might be the more appropriate choice. One example that behavior analysts will be familiar with is the use of scales
66
Statistics for Applied Behavior Analysis Practitioners and Researchers
to evaluate social validity. For example, Mery et al. (2022) sent participants a questionnaire containing 10 statements related to a training of which they participated such as “I would be willing to use behavior skills training to learn additional skills.” They asked the respondents to evaluate each statement on a 1 5 scale, where 1 5 strongly disagree, 2 5 somewhat disagree, 3 5 neither agree or disagree, 4 5 somewhat agree, and 5 5 strongly agree. Because individual responses as a whole are challenging to do much with, they chose to aggregate the results (derive a statistic!) for easier analysis. To do this, they calculated medians because the data were discrete ordinal data types. Lastly, one example of how mode can be effectively employed by behavior analysts comes from a recently published review by Jennings, Vladescu, Miguel, Reeve, and Sidener (2022). As a component of this review, the authors presented publication trends (p. 95) for journals publishing intraverbal research. Their figure 2 shows a bar graph with the count of articles published on intraverbals for each journal. Here, it would make no sense to talk about the average journal or even the median journal because journals are discrete nominal data. Thus these researchers reported on the mode journal which was the Journal of Applied Behavior Analysis6.
Percentage Another common way behavior analysts aggregate data is to calculate the percentage of instances in which something of interest happens. That something of interest might be “correct responding,” “agreement between two observers,” “intervals with social initiations,” or something else. Calculating percentages is fairly straightforward. You simply divide the count of the thing you are interested in by the total possible opportunities for the thing to have occurred. Continuing the previous examples, the “total possible opportunities” would be “trials where the client could have emitted the correct response,” “the number of observations made by the observers,” and “the total number of intervals wherein we observed behavior,” respectively. Once this proportion of opportunities with the target behavior has been calculated, 6
The discussion so far has focused on datasets with a single mode (unimodal). However, it is possible to have datasets with two or more modes (bimodal, trimodal, multimodal). As one example, review the social initiation data in Table 3.2. Here you will notice that there are two modes: 3 and 4.
How can we describe our data with numbers? Central tendency and point estimates
67
you simply multiply that proportion by 100 to get the percentage. Proportions can range between 0.0 and 1.0 and percentages can range between 0% and 100%. Interestingly, and different than response rate, percentages do not give any information about dimension of behavior. The reason is because dividing two numbers with identical units cancels them out leaving us with just a number. Although this is not inherently problematic, we point this out so you realize that calculating and working with percentages removes us from any specific behavioral dimension (i.e., you lose information). So, to the extent that some dimension of behavior is important to the research or practice question you are asking, calculating percentages may not be the best statistic to use. Further, it is this lack of dimension that causes “percentage” to mean different things depending on context and the underlying data we are using. In short, not all percentages are equal. So where do things get hairy with calculating percentages? Well, there are two use cases where your spidey senses should perk up and your critical thinking repertoire should be induced. First, sometimes percentage is similar to a special discrete quantitative data type called accuracy. We get deep into the weeds in Chapter 6 on accuracy along with eight alternative metrics for analyzing data in these contexts. But, briefly, and using interobserver agreement (IOA) as an example, calculating the “percentage of trials with agreement” adds up the number of instances in which observers agree that behavior occurred with the number of instances observers agree behavior did not occur. Then we divide the total trials observers agreed by the total number of trials and multiply the result by 100. Table 3.3 shows what is missing. With Table 3.3 Alternatives to calculating percentage using IOA as an example.
OBSERVER #1 Occurred (1)
Did Not Occur (0)
Occurred (1)
Agree Occurred
Disagree
Did Not Occur (0)
Disagree
Agree Did Not Occur
OBSERVER #2
IOA, Interobserver agreement.
68
Statistics for Applied Behavior Analysis Practitioners and Researchers
percentage of agreement, we have no information about which of the two “Disagree” buckets those observations fell. Were those disagreements balanced? Or did one observer tend to say the behavior occurred more than the other? If this information is important (e.g., Observer 1 is ground truth), then focusing only on the percentage of trials with agreement might obscure important information about how someone is performing and the type of errors they make. The second instance where your spidey senses should go up is when people start aggregating percentages. For example, let’s say you are reading an article where five different people collected data, each overlapped in their data collection with 2 3 other raters and IOA was calculated, and the authors report on the overall, “average” IOA. By this point in the chapter, hopefully, you appreciate what is missing. The arithmetic mean of “overall IOA” might be unrepresentative if one of the IOA values was significantly greater or lesser than the rest or the set of IOA values are skewed in some way. A giveaway that something like this might be going on is when you see a reported “average IOA” that is close to the minimum (or maximum) IOA value and much further away from the other. But, enough about this generalizable, high-level stuff. We would venture to guess that many, if not all of you dear readers, have some familiarity with percentages as a component of your educational history. And, we’d further venture to guess that you have encountered percentages when skimming through the pages of the many applied behavior analytic journals. To illustrate how behavior analysts use percentage in different ways similar to different measures of central tendency, we’ll present a few more examples from the published literature and how they do not always mean the same thing. Our second example comes from the context of skill acquisition where behavior analysts commonly calculate and report the percentage of responses that are correct out of all possible trials. For example, Halbur, Kodak, Williams, Reidy, and Halbur (2021) conducted a comparison of discrimination training conditions with three participants with autism spectrum disorder. Their primary dependent variable was the percentage of responses that were independently correct, calculated by dividing the number of independent correct responses (count) in a session by the number of total responses (count) in that same session, and multiplying by 100. (Note, we specified count parenthetically to
How can we describe our data with numbers? Central tendency and point estimates
69
highlight that the raw data reflected a behavioral dimension—count— but that behavioral dimension is lost through the calculation.) In this context, percentage is being calculated and used similarly to accuracy and comes with the caveats noted above. A third common example comes from the context of staff training or treatment fidelity. Here, behavior analysts commonly calculate and report the percentage of steps an employee performed correctly or accurately. For example, Campanaro and Vladescu (2022) were interested in the influence of computer-based instruction on staff implementation of discrete trial instruction. To determine whether staff were getting better at implementing discrete trial instruction, the researchers calculated the percentage of discrete trial instruction steps the staff implemented correctly. This was accomplished by adding the number of steps implemented correctly, dividing this number by the total number of steps (there were 10), and multiplying by 100. This statistical measure of staff performance was then plotted for visual analysis of intervention effects. This is another example of where the researchers used percentage similarly to accuracy. A fourth common example of when behavior analysts use percentage is when assessing client preference. Here, behavior analysts commonly will calculate and report the percentage of the total number of selection responses that were made to each stimulus in a preference assessment. For example, Basile, Tiger, and Lillie (2022) conducted a comparison of two approaches for conducting concurrent-chains preference assessments to evaluate relative consistency, correspondence, and efficiency. To report around the paired-stimulus preference assessments, the researchers calculated the percentage of times a stimulus had been selected by counting the number of times each stimulus was selected, dividing it by the total number of times the participant made a choice, and multiplied by 100. Note here that it doesn’t make much sense to talk about percentages as similar to accuracy. In contrast, percentages are being used similar to discrete nominal data such that mode is often the best measure of central tendency (i.e., Which stimulus was chosen more than the rest?). As one final example, percentage of intervals where behavior is observed is another instance where behavior analysts use percentage in a manner different from accuracy. Here, the behavior analyst uses some type of time sampling method (e.g., partial interval, whole
70
Statistics for Applied Behavior Analysis Practitioners and Researchers
interval, momentary time sampling) to collect data on whether or not behavior occurred during each observation interval. At the end of data collection, the behavior analyst then count the number of intervals where behavior occurred, divide it by the total number of intervals where the behavior analyst was observing the client, and multiply it by 100 to get the percentage of intervals with behavior occurring. As with the previous example, it doesn’t make much sense to talk about accuracy in these instances. Rather, percentage here is being used as a discrete quantitative data type. Thus median or mode might be the correct measure of central tendency depending on how many intervals are possible and the shape of the data distribution.
Chapter summary Behavior analysts collect a lot of data on many different things. Most of us do not have a wizard-like ability to analyze streaming raw data in order to make sense of the behavior-environment relations we observe. As a result, behavior analysts often aggregate the raw data (derive a statistic!) that was collected during a session into a single number. In statistical jargon, this is referred to as deriving a point estimate (single number) that captures the value of behavior someone is most likely to observe (central tendency). The three most common statistical point estimates of central tendency are the arithmetic mean, the median, and the mode. From a pure statistical standpoint, the most accurate statistical point estimate of central tendency depends on the data type and the relative distribution of the data we are trying to summarize. But, in case you may not have known, behavior analysis is not a discipline falling within pure statistics. This doesn’t mean that we get to do whatever we want. As Tables 3.1 and 3.2 showed, the accuracy of how we describe any single variable still requires us to pay attention to the data type we have and it is distributed (i.e., shape and presence of outliers). For example, it doesn’t make sense to predict someone will engage in 3.2 responses over the next 5 minutes; it doesn’t make sense to talk about the average toy chosen during a paired-stimulus preference assessment; and it doesn’t make sense to make claims about the median favorite color of students in a classroom. Paying attention to your data type and the reason you are reporting on your data will help you to avoid these silly illogical uses of numbers.
How can we describe our data with numbers? Central tendency and point estimates
71
In behavior analysis, we also will get a bit more wild in how we use the data we collect. Rather than using measures of central tendency for a single variable, we often aggregate data from two different variables into a single measure. The most common examples are rate of responding and percentage of [fill in the blank] where [target behavior] occurs. Rate of responding is often calculated like an arithmetic mean by summing up the total number of responses and dividing by the total number of minutes (or hours or sessions). To be an accurate description of behavior, calculating rate of responding this way requires that the individual observations come from a symmetric distribution. If they don’t, then the median responses per minute or one of those exotic “other” types of means might be a more accurate measure of central tendency. Finally, rate of responding converts a discrete quantitative data type (number of responses) to a continuous data type. This is useful when the questions the data are being used to answer involve detecting potentially nuanced differences between baseline and intervention conditions (i.e., most published behavior analytic literature). However, when describing or predicting behavior, rate of responding may not be logical if the context suggests we should talk about responses which can only come in whole numbers. Percentage is an even more context-dependent measure. This is because percentages involve dividing one number by a second number where both numbers have the same units. This “cancels out the units” making percentage a unit-less measure. As we saw above, this has led behavior analysts to use percentage in a manner representative of discrete nominal data (e.g., percentage of all choices a stimulus is chosen during a preference assessment; mode is the best choice); discrete quantitative data (e.g., percentage of intervals that a target behavior occurred; median or mode are the best choice); or accuracy—a specific type of discrete quantitative data (e.g., IOA, percentage of trials with correct responding; see Chapter 6 for more on the nuanced nature of accuracy). As with the rate of responding above, the takeaway message here is that you should always be thinking critically about your data context. What is the question you are answering? And, are you trying to describe or predict responses which can only come in whole units? To close this chapter, let’s return to our beautiful map of numberland. We rarely use our data in its raw form. When we aggregate our data we are deriving a statistic that is meant to best capture the central
72
Statistics for Applied Behavior Analysis Practitioners and Researchers
tendency of the many observations we made during that session. The measure we choose to use depends on our data type, the question we are trying to answer, and what is logically possible. Importantly, when we aggregate data at the session level (e.g., rate of responding, percentage), we should think critically about the data we have collected and the claims we are trying to make. Now, as you likely already knew, the measure of central tendency is not the whole story. Though any measure of central tendency will tell you the “most representative response” you might see, Chapter 2 and your experience in the field also demonstrate that all datasets have variability. In the next chapter, we get more into practical statistical measures of variability that you will likely employ. And, in Chapter 5 we get to combine Chapters 3 and 4 to talk about the size of impact that our interventions have on behavior. See you soon.
References Basile, C. D., Tiger, J. H., & Lillie, M. A. (2022). Comparing paired-stimulus and multiple-stimulus concurrent-chains preference assessments: Consistency, correspondence, and efficiency. Journal of Applied Behavior Analysis, 54(4), 1488 1502. Available from https://doi.org/10.1002/jaba.856. Campanaro, A. M., & Vladescu, J. C. (2022). Using computer-based instruction to teach implementation of discrete trial instruction: A replication and extension. Behavior Analysis in Practice. Advance online publication. Available from https://doi.org/10.1007/s40617-022-00731-7. Carr, J. E., Nosik, M. R., & Luke, M. M. (2018). On the use of term ‘frequency’ in applied behavior analysis. Journal of Applied Analysis, 51(2), 436 439. Available from https://doi.org/ 10.1002/jaba.449. Dodge, Y. (2010). The concise encyclopedia of statistics. Springer, ISBN: 0397518374. Fox, A. T., Smethells, J. R., & Reilly, M. P. (2013). Flash rate discrimination in rats: Rate bisection and generalization peak shift. Journal of the Experimental Analysis of Behavior, 100(2), 211 221. Available from https://doi.org/10.1002/jeab.36. Halbur, M., Kodak, T., Williams, X., Reidy, J., & Halbur, C. (2021). Comparison of sounds and words as sample stimuli for discrimination training. Journal of Applied Behavior Analysis, 54(3), 1126 1138. Available from https://doi.org/10.1002/jaba.830. Jennings, A. M., Vladescu, J. C., Miguel, C. F., Reeve, K. F., & Sidener, T. M. (2022). A systematic review of empirical intraverbal research: 2015 2020. Behavioral Interventions, 37(1), 79 104. Available from https://doi.org/10.1002/bin.1815. Johnston, J. M., Pennypacker, H. S., & Green, G. (2020). Strategies and tactics of behavioral research and practice (4th Ed.). Routledge. Killeen, P. R. (2019). Bidding for delayed rewards: Accumulation as delay discounting, delay discounting as regulation, demand functions as corollary. Journal of the Experimental Analysis of Behavior, 112(2), 111 127. Available from https://doi.org/10.1002/jeab.545. Landa, R. K., Hanley, G. P., Gover, H. C., Rajaraman, A., & Ruppel, W. K. (2022). Understanding the effects of prompting immediately after problem behavior occurs during functional communication training. Journal of Applied Behavior Analysis, 55(1), 121 137. Available from https://doi.org/10.1002/jaba.889.
How can we describe our data with numbers? Central tendency and point estimates
73
Manikandan, S. (2011). Measures of central tendency: The mean. Journal of Pharmacology & Pharmacotherapeutics, 2(2), 140 142. Available from https://doi.org/10.4103/0976-500X.81920. Mery, J. N., Vladescu, J. C., Day-Watkins, J., Sidener, T. M., Reeve, K. F., & Schnell, L. K. (2022). Training medical students to teach safe infant sleeping environments using pyramidal behavioral skills training. Journal of Applied Behavior Analysis. Advance online publication. Available from https://doi.org/10.1002/jaba.942. Shimp, C. P. (1969). The concurrent reinforcement of two interresponse times: The relative frequency of an interresponse time equals its relative harmonic length. Journal of the Experimental Analysis of Behavior, 12(3), 403 411. Available from https://doi.org/10.1901/jeab.1969.12-403. Sloman, K. N., McGarry, K. M., Kishel, C., & Hawkins, A. (2022). A comparison of RIRD within chained and multiple schedules in the treatment of vocal stereotypy. Journal of Applied Behavior Analysis, 55(2), 584 602. Available from https://doi/org/10.1002/jaba.906.
CHAPTER
4
Just how stable is responding? Estimating variability A statistician will stand with their head in an oven and their feet in a block of ice and tell you, on average, they feel fine.
Introduction By this point in the book, we hope you have come to realize that statistics are all around you. Any time you collect data from more than one observation and describe those data with fewer numbers, you are aggregating data and using statistics (Chapter 1). Though this may sound straightforward, how you go about choosing the “best” number or set of numbers to describe your data is not always obvious. In Chapter 3, we reviewed some of the options for how we might describe many observations with a single number and some of the criteria for selecting among those options. In Chapter 3, we also reviewed the conditions under which the different types of central tendency point estimates are and are not appropriate. But using only the central tendency does not tell the entire story of our data and has several limitations by itself. One of those limitations is that a single numerical representation of the “typical” or “most common” observation does not tell you how variable your data are around that number. Were all the data really close to that single number? Or were they all quite far from that number? Or were there several clusters of observations at varying distances from that single point estimate of central tendency? To precisely and quantitatively describe this variability in their data, scientists and practitioners often will add one or more numbers to the point estimate of central tendency. These are referred to as descriptions of variability. Similar to point estimates of central tendency, there are many different ways that behavior analysts can go about describing the variability in their dataset. And, also like point estimates of central tendency, the option chosen depends on the function of the behavior analyst’s communications—why are they talking about their data and with whom? Statistics for Applied Behavior Analysis Practitioners and Researchers. DOI: https://doi.org/10.1016/B978-0-323-99885-7.00004-0 © 2023 Elsevier Inc. All rights reserved.
76
Statistics for Applied Behavior Analysis Practitioners and Researchers
The purpose of this chapter is to help behavior analysts choose among the many options available to describe the variability of their data. To do this, we begin by discussing common ways to talk about the variability of the data in the dataset. After that, we discuss common ways to talk about the likely variability in the measure of central tendency we have chosen. Throughout, and because all descriptions of data are limited in one way or another, we also review the conditions under which each description of variability might be useful and when they might be problematic. For all descriptions of data variability, we also point out the assumptions the behavior analysts must make when using different descriptions of variability, and we close by providing recommendations on when to use different descriptions of data variability based on the dataset you might be working with.
Describing the spread of your data Perhaps the most common description of data variability used by behavior analysts is the spread of their data. Describing the spread of your data can be accomplished in at least four different ways (Fig. 4.1). These are (1) min and max values, (2) range, (3) interquartile range (IQR), and (4) standard deviation.
Min and max values The top panels in Fig. 4.1 show the most common way to describe the spread in a dataset via simply stating the minimum and maximum observed values. As a sentence for the top left panel, that might look something like, “During [condition], the client emitted an arithmetic mean of 7.93 mands per minute (min 5 0.00, max 5 20.11).” Here the author would be stating that the minimum and maximum observed rate of mands per minute during baseline was 0.00 and 20.11, respectively. Note, here, that the min and max values are statistics even though they are individual datum within your dataset. This is because (1) they are aggregate descriptions of what happened in that single session, and (2) that simple description lets the reader know that the rest of your data, in aggregate, is higher than the min value and lower than the max value.
Range Another way to describe the spread of your data is by describing the range. The range reflects a single number that describes the difference
Just how stable is responding? Estimating variability
77
Figure 4.1 Examples and limitations of using the spread to describe variability in data. IQR,Interquartile range; Mean, arithmetic mean; STD, standard deviation.
between the maximum value in your dataset and the minimum number in your dataset (second row of panels in Fig. 4.1). Calculating the range is accomplished through a simple equation: Range 5 ðmaximum valueÞ 2 ðminimum valueÞ
(4.1)
Using the right panel of the second row in Fig. 4.1 as an example, the range of the observed data for this hypothetical client was 25.49 mands per minute (30.845.35 5 25.49). A sentence describing this
78
Statistics for Applied Behavior Analysis Practitioners and Researchers
might look something like, “During [condition], the client emitted an arithmetic mean of 12.06 mands per minute (range 5 25.49).”
Interquartile range A third method to describe the spread in your data is called the interquartile range (IQR; third row of panels, Fig. 4.1). In the last chapter, we talked about one measure of central tendency called the median (or middle) value. To recap, the median value is calculated by simply arranging all the data in order from highest to lowest values, and finding the number right in the middle at the 50% mark. The IQR uses this same idea but, rather than splitting our data in half at the 50% mark, we split our data into four buckets with the same number of observations in each. This allows us to identify the value of our data at the 25%, 50% (median), and 75% marks. By marking the 25%, 50%, and 75% points in our data, we now have four equal-sized bins (i.e., 0%25%, 25%50%, 50%75%, 75%100%), or in the language of math, we have created quartiles. While the range describes the difference between the maximum and minimum value (i.e., 100%0% data values), the IQR describes the difference between the 75% and 25% values. Restated as another easy equation: IQR 5 ð75% valueÞ 2 ð25% valueÞ
(4.2)
The IQR describes how spread out the data are that falls right in the middle half. The IQR answers the question, “How spread out are the 50% of observations that fall in the middle?”
Standard deviation A final common way to talk about the variability in our dataset as a whole is via the standard deviation (bottom row of panels in Fig. 4.1). Standard deviation answers the question, “On average1, how far from the measure of central tendency is each datum?” Small standard deviations would indicate that the data has little variability and large standard deviations would indicate that the data has high variability. What counts as “small,” “little,” “large,” and “high” here would be determined by the clinical/educational/research context, the behavior of interest, and what high or small variability means practically in the clinic/educational/research context. 1
Hopefully, after the last chapter your spidey senses start to perk up whenever you see the word “average.” As a reminder, certain assumptions need to be met about your data for the average to be an appropriate measure of your data. The same holds here.
Just how stable is responding? Estimating variability
79
Calculating standard deviation starts to get into some of the fun logic behind numbers and how we use them to describe data. If you recall from Chapter 3, to calculate the average value of a dataset (e.g., arithmetic mean), you add all of your values together and then divide by how many numbers went into that sum. As words: Average 5 ðsum of all numbersÞ=ðcount of numbersÞ
(4.3)
As an equation: μ5
XN i
ðxi Þ=N
(4.4)
The use of symbols and their positioning may be off-putting, but fear not as Eqs. (4.3) and (4.4) are expressing the same thing using a different set of verbal stimuli. More specifically, the μ represents “average” (arithmetic mean2), the ΣðxÞ represents “add up all xs.” The letters i and N represent which number to start adding (i) and which number to stop adding once you reach it (N). In most cases, i is the first value in our dataset and N is the last value in our dataset. Restated with the translation, “The average (μ)numberin my dataset PN is calculated (5) by adding up all the numbers i ðxi Þ and dividing by the count of how many numbers we have (N).” Calculating standard deviation is the same idea, except we’re calculating the average difference from the arithmetic mean rather than average value in our dataset. Eq. (4.5) shows what exactly that looks like for calculating standard deviation3: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PN 2 i ðxi 2μÞ (4.5) σ5 N More verbosely using words, we calculate the difference between each observation and the arithmetic mean (xi 2 μ), we square each of those values (xi 2 μ)2, add all of those up (Σ), divide by the total number of observations (N), and then take the square root of that final, 2
The same thing as x-bar in Chapter 3. Savvy readers will likely note the equation for the standard deviation of a population. To calculate the standard deviation of a sample, simply make the denominator N 2 1 instead of N. Whether or not you do this with your data will depend on whether you have observed all responses under some environmental condition (i.e., the population of responses) or you have observed only some of the responses under some environmental condition (i.e., a sample of responses). 3
80
Statistics for Applied Behavior Analysis Practitioners and Researchers
single number4. The result is a number that represents, on average (i.e., on arithmetic mean), how far away each observation is from the arithmetic mean of the entire dataset.
Benefits and drawbacks of each description of spread Each method of describing the spread of your data can be useful depending on what you are trying to communicate to your audience. Explicitly stating the minimum and maximum values of your data is useful when you want the audience to know the least or most amount of behavior they might expect to observe if they were working with that individual. Stated differently, if a behavior analyst were to observe a response rate anywhere between the minimum and maximum observed values, then they likely have set up the environment-behavior conditions in a functionally similar way to the conditions that led to the original dataset. If a behavior analyst observed response rates below the minimum or greater than the maximum, then this might suggest that different environment-behavior relations have been created and should be investigated further. Explicitly stating the range of your data also can be useful as it lets the audience know the degree of variation they might expect within a condition from session to session so that they can be prepared to respond accordingly. Reporting the minimum and maximum values or range as descriptors of variability can be misleading in some contexts (top panels in Fig. 4.1). First, these indices of variability are highly sensitive to outliers in your dataset (right panel, second row, Fig. 4.1). If you recall from Chapter 3, outliers can be defined as measurements recorded that are so different from the rest of your dataset, that an explicit decision must be made as to how to handle them. We will talk more about handling outliers later in the book. But, for our purposes here, the point is that reporting variability using min and max or range would be misleading with the data in the right panel of the second row in Fig. 4.1. Here, the description would be, “Rates of responding during this condition averaged 12.06 (range, 5.3530.84).” But, as a reader with access to the data, it’s easy to see that the max value stated really is not a great description of what is likely to be observed during that condition. A clear indication that min and max values or range might not be an appropriate description of variability in a dataset occurs when 4
Readers interested in the logic behind why we do this should see the Appendix to this chapter.
Just how stable is responding? Estimating variability
81
the measure of central tendency is close to the minimum (or maximum) value but far from the other values. Using the IQR helps solve the issue of the range being sensitive to outliers. By focusing on the data only between the 75% and 25% values, any extreme values likely to be outliers are not included in the description of variability. Continuing the example from the preceding paragraph, the description would be, “Rates of responding during this condition averaged 12.06 (IQR 8.6812.63).” IQR, however, does not solve for the second common instance where range is a misleading description of our data: trending time series data. Let’s look at why range can be misleading when we have trending time series data. Consider the dataset with trending time series data in the left column of Fig. 4.1. Time series data are datasets wherein one of the important dimensions for analysis is time. Behavior analysts commonly use time series data such as when sessions or calendar date make up the x-axis—both of which are ordered left-to-right by time. Trending time series data makes using these indices of variability problematic because they fail to communicate to the reader what they can expect. For example, the min and max value descriptions of variability in the upper left panel of Fig. 4.1 would be “min 5 0.00, max 5 20.11.” But, assuming the reader is curious about what they can expect in their next session, the max might be useful here, but the min certainly is not. The same holds for using range and IQR with these trending time series data. In each instance, the description of variability is misleading in terms of helping the audience know what the might expect to happen next5. Using only the topics we have covered thus far in the book, there is no ready solution to this problem. However, later in the book, we take a much deeper dive on time series data and how to handle these situations. For now, the key takeaway is that your spidey senses should be induced if you see someone reporting min and max values, 5 At a more generalizable level, this discussion highlights the relevance of local vs. global descriptions of data and when each might be useful. Local descriptions of data are descriptions wherein we use only a subset of our total dataset to derive a point estimate (central tendency or variability) based on an important independent variable such as using only data from the most recent five sessions. Global descriptions of data are descriptions wherein we use all data relevant to the point estimate we are deriving (e.g., all data from baseline vs. intervention conditions). Later in the book, we’ll discuss the conditions under which you might choose local vs. global metrics for your data once we have a few more topics under our belt. But, for now, the main point is that using these indices of variability as a description of trending time series data is often misleading and should be avoided.
82
Statistics for Applied Behavior Analysis Practitioners and Researchers
range, or IQRs with trending time series data or they do not provide evidence that their data are not trending. Lastly, let’s chat about the standard deviation to close out this benefits and drawbacks section. Standard deviation is particularly useful when you want to communicate something to the reader about what they can expect in terms of session-to-session variability. The max and min values may never be observed again, the range is just a different way to describe the max and min also may never be observed again, and the IQR focuses on only two data points in your dataset (i.e., the 25th and 75th percentiles). But, on average, from observation to observation, a reader might want to know what kind of variability they are most likely to contact. The standard deviation gives the reader that information directly. The primary drawback to using the standard deviation goes back to the assumptions needed for the standard deviation to be an accurate measure of variability: a symmetric distribution. When we calculate standard deviation, we are essentially creating a new dataset. The new dataset we are creating is made up of each new value that we get by squaring the difference between the observed value and the arithmetic mean of the original dataset. If this new dataset is not symmetrically distributed, then taking its arithmetic mean would be misleading (see Chapter 3 if you can’t quite remember why). The giveaway that this might be inappropriate would be if your original dataset is skewed in some way or has a lot of outliers. And, just as with point estimates of central tendency, you can describe your variability using the median difference from your measure of central tendency, trimmed mean difference from your measure of central tendency, or perhaps the mode difference from your measure of central tendency.
Describing how well you know your measure of central tendency In the previous section, we talked about several ways to describe the variability of your data. But, readers may not be interested in all the nuance of every observation you made. Rather, they might want to know whether the intervention was effective (yes/no), how much the targeted behavior actually changed (average or overall improvement from baseline), or what they might expect to see during any given
Just how stable is responding? Estimating variability
83
observation moving forward (on average or during most observations). That is, in many contexts, the Individualized Education Program (IEP) team, parents, or new paraprofessional want to know the measure of central tendency and how confident you are in your prediction that the current pattern will continue. In this section, we look at several ways to talk about that uncertainty in your descriptions and predictions about your chosen measure of central tendency.
Standard error A subtle assumption underlies this entire chapter. That assumption is that there is likely to be natural variation or variability in the data we collect (i.e., our data are not perfectly identical across all observations). Some days the client might obtain 94% correct, other days 93% correct, and still on other days 100% correct. Despite our best efforts, we are unable to control everything in the universe and so, because behavior is complex, things outside of our control occur and influence the behavior we are measuring. When we try to talk about what steadystate responding (i.e., measures of central tendency) looks like during ongoing data collection, variability in our data poses a bit of a problem because every new observation that we make may change how we can talk about the measure of central tendency (e.g., arithmetic vs. median) and the value that measure takes. Sometimes this natural variation in behavior will only slightly impact what measure we use and the value it takes. And, sometimes the natural variation will have a large effect. Fig. 4.2 shows a visualization of this challenge. We start with some set of data, we calculate the mean responses per minute, we collect new data, we update our calculated mean, and round and round we go. As you likely noticed, with each new batch of data, the mean likely changes. And the amount it changes depends on the overall variability of responding observed with the newly collected data, how much that data differs from previous data, and how much data we have overall. For behavior analysts who communicate regularly with their clients, their clients’ caregivers, collaborators on the IEP team, or other stakeholders, such constant fluctuations in the mean might be hard to interpret. How confident are we in our calculated mean responses per minute as being indicative of the effect of an intervention on the individual’s current behavioral repertoire?
84
Statistics for Applied Behavior Analysis Practitioners and Researchers
Figure 4.2 Visualization for what standard error is trying to solve for.
Standard error is one way to provide information about how confident we can be in the description of our data provided by the measure of central tendency. The most common standard error is likely the standard error of the mean (SEM). As the name implies this is used when we want to communicate about how confident we can be in using the arithmetic mean to describe our data. As a question, how close do we think our mean is to measuring the “true mean” if it were known? Though the “true mean” is likely unknown, an intuitive observation is that the more data we collect on responding within a condition, the more likely we are to know exactly what effect that environmental condition has on responding. Fig. 4.2 provides a visual example. We are much more likely to feel confident in making claims about steady-state responses per minute as we go from three data points to nine data points. Adding observations increases our confidence in our numerical description of how the contingencies in effect influence responding. Quantitatively, the changes in the calculated means demonstrate this increased confidence in our descriptions of environment-behavior relations, too. In the top row,
Just how stable is responding? Estimating variability
85
the mean changes from 3.00 to 4.83 as we go from three data points to six data points, a difference of 1.83. As we go from six to nine data points, the mean changes from 4.83 to 4.78, a difference of 0.05. Succinctly, as we add more observations, our confidence in our measure of central tendency increases. We can turn that textual description into an equation. In words, our confidence in our measure of central tendency (e.g., our claim about responses per minute) depends on how variable the data are and the number of observations we have made. We already have an equation that talks about the variability in our data via the standard deviation (σ) in Eq. (4.5). All that’s missing is to include how our descriptions of central tendency get better with more observations. As the number of observations goes up, we would want the standard error to go down. An easy way to do this would be to simply take our equation of the variability in our dataset and divide it by the number of observations. As an equation: σ Standard error of the mean 5 SEM 5 pffiffiffiffiffi N
(4.6)
There are at least two ways to write or talk about this description of SEM. One is to simply list the SEM following the statement about our measure of central tendency. For example, we might write, “Ada emits 8.62 responses per minute (SEM 5 1.32)” or “Ada emits 8.62 responses per minute 6 1.32 (SEM).” As a reader, this lets you know that the “true average of responses per minute” might be between 7.30 and 9.94 (Note, this tells the reader nothing about the variability in responding they might observe). However, because people have to do the math on their own in this situation, a second way we can write this to help the reader might be to add that SEM range into the description for them, “Hadiza is likely to emit average responses per minute between 7.30 and 9.94.” As a final point on standard error, you may recall from Chapter 3 the different ways we can talk about central tendency in our data such as the median, mode, arithmetic mean, geometric mean, harmonic mean, trimmed means, and so on. Similarly, there are also standard error calculations for these other measures of central tendency. We focused on the SEM above and we’ll leave those other descriptions to
86
Statistics for Applied Behavior Analysis Practitioners and Researchers
the curious reader and Google; but they all attempt to answer the same questions: How well do we actually know our measure of central tendency? How confident can we be in this description of our data? And, they seek to answer those questions precisely and numerically.
Confidence intervals Behavior analysts functioning as practitioners know that their clients emit all sorts of responses throughout the day in their absence. It is possible that many of these responses are functionally equivalent to the responses observed during applied behavior analysis sessions. As a result, any single observation that we make (and the data collected) comprise only a sample of all environment-behavior relations we could possibly observe. So how accurate is our measurement compared to what we would get if we measured all instances of behavior throughout the entire day? Confidence intervals are designed to try to answer this question for all functionally similar situations. Confidence intervals attempt to answer the above question by building on our descriptions of how well we know the mean via the SEM. At its core, standard error balances the variability we see in one measure of our data compared to how many observations we have actually made. Confidence intervals are essentially asking: What if we get a bunch of those measures that balance variability and the number of observations but from different time samples? Thinking about confidence intervals this way is useful because it allows us to build on the equation for SEM from before. And, what we add is determined by the new question that we are trying to answer. To add on to the standard error equation, we have to introduce the idea of the central limit theorem. You likely recall from the previous chapter that the best choice of measure of central tendency depends on characteristics of your data. For some data, the arithmetic mean is a great choice, for others the median is best, and for others, mode might be most appropriate. The central limit theorem is a neat mathematical proof6 that turns this information a bit on its head. It shows that— regardless of what the raw data looks like—if we calculate the mean of If this term is new to you, a mathematical proof involves two things. The first is a “proof.” Proofs are a set of arguments that begin with one statement and show how you get to another statement using only the rules of logic. The second bit is “mathematical.” All that means here is that we’re deriving a proof using the symbols and logic of mathematics. For more about this stuff, we recommend Cupillari (2005) and Clapham and Nicholson (2014).
6
Just how stable is responding? Estimating variability
87
our data, many times over from repeated samples, and we graph how many times we get different means (i.e., the distribution of those means), the distribution of our calculated means will approach a normal distribution (i.e., Gaussian distribution or the “bell curve”; Motulsky, 2013). Fig. 4.3 shows why it is useful that we get a normal distribution when calculating many means from different samples of our response measurements. Because of the properties of the normal distribution, we know that approximately 68% of our measurements will fall within one standard deviation of the mean. That could be either one standard deviation less than the mean or one standard deviation greater than the mean. Similarly, we know that 95% of our observations will fall within two standard deviations of the mean, and that 99.7% of our observations will fall within three standard deviations of the mean. By now, you’ve probably already guessed where we are headed. Confidence intervals are useful when we want to predict what the possible values of the mean will be if we could collect time samples spanning all functionally similar time samples. As Fig. 4.3 shows, we could predict that 95% of the time we calculate a mean rate of response for a client, it is likely to fall within two standard deviations of the mean of all our measures of central tendency (and the other 5% of the time it will not). Another way to describe how many standard deviations something is from the mean of all measures is through what’s called z-score. As shown in Fig. 4.3, this is almost the same thing as the
Figure 4.3 Visualization for how standard deviations and confidence intervals are calculated.
88
Statistics for Applied Behavior Analysis Practitioners and Researchers
number of standard deviations from the mean. So, depending on how confident you wanted to be with your predictions about your measures of behavior, you would simply choose the z-score to match! We can now tie all of this together into the equation for calculating confidence intervals. Recall that our confidence interval calculation is building off our standard error equation: σ SEM 5 pffiffiffiffiffi N We know that our measure of behavior is likely to differ from the true variability we would get if we were able to observe all instances of functionally similar behavior-environment relations. So, the next step is to account for this variability by using the z-scores we just talked about to increase the estimated variability in behavior that we are likely to observe: σ Sampling error 5 z 3 pffiffiffiffiffi (4.7) N And, because we often have one general estimate for the average pattern of responding we are interested in (e.g., a single measured mean response rate; i.e., x), we create our confidence interval around this measure by adding and subtracting the sampling error from that value: σ CI for mean 5 x 6 z 3 pffiffiffiffiffi (4.8) N Intuitively, this equation combined with Fig. 4.3 highlights how the more confident we want to be in predicting average future patterns of behavior, the wider that interval will get because z will increase. As a final note, just as with standard error, you can calculate confidence intervals around just about any measure of central tendency that you want to use. For example, the confidence interval for medians is given by the following equation: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi CI ranks for median 5 ðN 3 0:5Þ 6 z 3 ðN 3 0:5Þð1 2 0:5Þ (4.9) Here, 0.5 is chosen because that is the quantile of the median observation in all datasets (our measure of central tendency we are
Just how stable is responding? Estimating variability
89
calculating the interval around), N is our number of observations as before, and—because of the central limit theorem—we get to use z values again to determine how confident we want to be (Conover, 1999). Once the positive and negative versions of Eq. (4.9) are calculated, you round those numbers to whole values, and then find those values in your ranked dataset. Your confidence interval for the median then spans from the low value to the high value (see Table 4.1 for an example). For readers interested in why this equation takes the shape that it does and for confidence interval equations for other measures of central tendency, we will leave that dialog to you and ChatGPT (OpenAI, 2022).
Other flavors for describing variability in your data To this point in the chapter, we have focused primarily on the measures of variability for data that have two characteristics. First, the Table 4.1 Demonstration of calculating the confidence interval for the median.
90
Statistics for Applied Behavior Analysis Practitioners and Researchers
data are continuous (i.e., can take any value greater than zero and does not have to be a whole number). Second, we feel okay assuming that responding is stable and that the behavior we collected data on is representative of that person’s behavior when we are not around. But, sometimes our data are ordinal (e.g., ranking of preferences) or categorical (e.g., baseline vs. intervention; favorite color of students in a classroom). And, sometimes the data we collect suggest the behavior is trending or that responding during certain sessions or observation windows deviates substantially from what we typically observe. In this final section, we discuss measures of variability that can be used with categorical data, ordinal data, when our data are trending, or when our data are skewed in one direction or another.
Variation ratio The variation ratio is one method for describing the variability of discrete nominal data. For example, perhaps we are interested in the variability around the number of verbal operants mastered per week (e.g., mands, tacts, intraverbals, echoics). Or perhaps we are interested in describing the variability in preference assessment rankings across a set of toys over time. The variation ratio is perhaps the simplest description where we simply denote the proportion of cases that are not in the modal category. As an equation: Variation ratio 5 1 2 pðModeÞ
(4.10)
For example, consider a situation where our client mastered 60 mands, 30 tacts, 5 intraverbals, and 5 echoics (Table 4.2). We can calculate the variation ratio for mands by plugging these data into the equation: 1 2 p(60/(60 1 30 1 5 1 5)) 5 1 2 (60/100) 5 10.60 5 0.40. To use the variation ratio in a sentence, you might write, “The mode verbal operant learned during the last month was 60 mands (variation ratio 5 0.40).” Note how, as a reader, without seeing any of the other data you know that approximately 40% of the mastered operants were something other than mands (and that 60% of the mastered operants were mands). And, as that variation ratio increases (e.g., 0.90, 90%) and decreases (e.g., 0.10, 10%), the reader gets a better sense of just when the modal category is dominant (e.g., variation ratio 5 0.10) or that a fair amount of variability is present in the data (e.g., variation ratio 5 0.90). Of note, variation ratios can range between 0.0 and 1.0.
Just how stable is responding? Estimating variability
91
Table 4.2 Demonstration of how aggregate percentage difference from mode (APDM) provides additional useful information about variability in categorical data.
As you may have noticed, though the variation ratio does provide quick information on the dominant category in our data, it does not provide us much information about the variability of the nondominant categories. For example, consider the two sets of data in Table 4.2.
92
Statistics for Applied Behavior Analysis Practitioners and Researchers
Similar to the example above, the data categories might be the number of verbal operants mastered, the sum of rankings for each item used within preference assessments conducted in the last month, or some other dataset where our data are discrete nominal or, perhaps, discrete ordinal. In both situations, the variation ratio gives you the same number; however, in the dataset with less variability, responding is primarily concentrated in two categories (A and B), whereas responding is evenly distributed across the three nonmodal categories (B, C, and D) in the dataset with more variability.
Aggregate percentage difference from mode One way to capture this difference in variability is by calculating the mean percentage difference from the mode. Just as the name implies, we are again interested in understanding how much the categorical counts differ from the mode. To obtain the additional information, we calculate what is called a weighted percentage of those differences. Weighting a number simply means that you give data that comprises a larger proportion of your observations a greater impact on the final value compared to data that comprises a smaller portion of your data. For example, for the data in the column of Table 4.2 containing less variability, Category B comprises 35% of all observations, whereas Categories C and D comprise only 1% and 4% of the data, respectively. Thus the difference between Category B and Category A is given more weight when calculating variability compared to the differences between Categories C and A, and between Categories D and A. Weighting your data is a rather straightforward procedure. We can use words to describe this process as, “Find the difference between your category of interest and the mode category, then multiply that difference by the percentage of your data that category comprises.” For Category B in the left column of Table 4.1, that would be (6035) 3 (35/100) 5 25 3 0.35 5 8.75. As an equation using words, that would look like this: target count ð½Mode count 2 ½target countÞ 3 (4.11) total responses Now that we have calculated the weighted difference from the mode for one category, to get the aggregate percent difference from mode (APDM), we would simply repeat the same process for the remainder of the categories and add them all up.
Just how stable is responding? Estimating variability
93
As a more formal equation, we can again use the math symbol Σ to say “add up everything that follows” and use the nomenclature that works regardless of the number categories we have (from Category i to Category N). This gives us the mathematical equation: Xn icount APDM 5 ðMode 2 icount Þ 3 (4.12) i51 Total count Looking at Table 4.2, we can see that APDM provides us with different measures of variability for the two different datasets that can be interpreted in a straightforward manner. The data on the left with less variability has a smaller APDM than the data on the right which has greater variability.
Consensus A final common measure of variability used for discrete ordinal data is called consensus. With consensus, the main idea is to figure out whether there was a complete lack of consensus (0.0) or a complete consensus (1.0). A complete lack of consensus (i.e., c 5 0.0; high variability) would mean that the number of responses at one end of an ordinal scale is the same as the number of responses at the other end of an ordinal scale. For example, perhaps a client has the same number of sessions getting 0% correct responding as they do the number of sessions getting 100% correct responding. In contrast, a complete consensus (i.e., c 5 1.0; no variability) would mean that all responses fell on one end of an ordinal scale such as in a situation in which a client emits 100% correct responding across all sessions. Although interpreting consensus is quite easy, the calculation is a bit complex. First, we obtain the proportion of all responses that fall into each value on the ordinal scale (pi). Second, we calculate the average value for all responses (μx ). For example, let’s assume the categories in Table 4.2 are actually an ordinal scale where A 5 1, B 5 2, C 5 3, and D 5 4. If we multiply the value counts in the column (60, 35, 1, and 4) by the ordinal value (1, 2, 3, 4) we get an average of 1.497 for the data with less variability and an average of 1.758 for data with more variability. Third, we calculate the difference between the maximum and minimum value in our dataset (dx). This would be 7 8
(60 3 1) 1 (35 3 2) 1 (1 3 3) 1 (4 3 4) 5 60 1 70 1 3 1 16 5 149/100 5 1.49. (60 3 1) 1 (15 3 2) 1 (15 3 3) 1 (10 3 4) 5 60 1 30 1 45 1 40 5 175/100 5 1.75.
94
Statistics for Applied Behavior Analysis Practitioners and Researchers
41 5 3 for the example we are working through here because Category D had the highest value in 4 and Category A had the lowest value in 1. Lastly, for each category, we calculate the absolute value of the difference between that category value and the mean (|Xi 2 μx |). All of these are then used to calculate consensus in the following equation: Xn jXi 2 μx j Consensus 5 1 1 p 3 log2 1 2 (4.13) i51 i dx For the data with less variability, we get a consensus score of 0.64 and for the data with high variability, we get a consensus score of 0.42. Remember above that consensus 5 1.0 means there is no variability whatsoever and when consensus 5 0.0, it means there is maximal variability. Interpreted back into the land of behavior analysis, higher consensus scores means lower variability which means more agreement (consensus) in responding from observation to observation.
Choosing and using measures of variance in applied behavior analysis Table 4.3 provides a review of the different measures of variability discussed throughout this chapter, the situations it is likely ideal to use that quantitative description of your data, and the situations it is likely best to avoid using that measure. Because these have been discussed extensively throughout the chapter, we will not go into much additional detail here. However, to choose among the many options for describing the variability in your data, three overarching themes have been present throughout the chapter that are worth repeating here. The first theme is audience control. At the end of the day, statistics are just verbal behavior that plays a functional role in communicating something about your data to the likely listener or reader. The description of data variability you choose to use should be determined by what information about your data is most useful to your audience. Is it the worst and best case scenario of responding (e.g., min and max values, range), the range of data the person is most likely to contact during a typical session (e.g., IQR, standard deviation), how well you have measured your dependent variable (e.g., standard error, confidence interval), or something about categorical or ordinal data (e.g., variation ratio, APDM, consensus)?
Just how stable is responding? Estimating variability
95
Table 4.3 Outlining of the ideal circumstances under which you might choose each measure of variability to describe your data quantitatively and precisely. Description of
Ideal use case
Situations to avoid using
Min and max
Reader wants to know the lowest and highest amount of behavior they might observe. The data are continuous or ordinal.
Max and min are outliers.
Range
Reader wants to know the maximum variability they might observe. The data are continuous or ordinal.
Max and min are outliers.
Interquartile range
Reader wants to know the most common range of variability they might observe. The data are continuous or ordinal with six or more ordinal values.
The best (or worst) case scenarios are important to communicate. Responding is trending or skewed.
Standard deviation
Reader wants to know, on average, how different from the average patterns of responding they might observe.
Responding is trending or skewed.
Standard error
You want to communicate with the reader how well you have measured your aggregate measure of central tendency.
You are interested in communicating to the reader how variable the raw data are.
Confidence intervals
You want to communicate the range that your aggregate measure is likely to fall within if you had hundreds or thousands of observation windows.
You are interested in communicating to the reader how variable the raw data are.
Variation ratio
Categorical or ordinal data where you are primarily interested in one particular response.
You are interested in talking about two or more categories of responding or your data are continuous.
Aggregate percent difference from mode
Categorical or ordinal data where you are interested in variability observed among all categories. Continuous data that is trending or skewed (NB: Bins of data will be needed here).
Your data are continuous data and at steady state.
Consensus
Ordinal data where all ordered values are important (i.e., you are not just interested in one of the response categories).
Your data are continuous data and at steady state.
variability
The second theme is that the type of data you have and the shape that it takes should determine which description of variability you use. Many of the descriptions of data variability in this chapter assume that responding is continuous, stable, and therefore likely to be normally distributed around the measure of central tendency you are using. We got into the nitty-gritty details on data distributions and data types in Chapter 2 if you need a refresher. And, later in the book, we also review how to handle situations where these assumptions might not be met. At this point in the book, however, it is important to note that the use of these descriptions of data variability comes with certain underlying assumptions about your data. If those assumptions are not
96
Statistics for Applied Behavior Analysis Practitioners and Researchers
met, then you would want to identify a different measure of the variability in your data. The final theme is that the descriptions of variability we reviewed represent only a small set of the total number of ways to describe and talk about the variability present in your data. We chose to present these as they are, arguably, the most common types of descriptions of data variability that you will encounter when reading research from other fields or that readers of your own work might be interested in seeing in your communications. Nevertheless, though we have included this set of descriptions because they are commonly found in the research literature, your decision as to how to describe the variability in your data should be determined by the first two themes noted previously: What description of variability is most useful to the likely reader and how does the type of data you have suggest the description that best meets that function.
References Clapham, C., & Nicholson, J. (2014). The concise Oxford dictionary of mathematics (5th ed.)). Oxford University Press, ISBN: 9780199679591. Conover, W. J. (1999). Practical nonparametric statistics. John Wiley & Sons, ISBN: 9780471168515. Cupillari, A. (2005). The nuts and bolts of proofs (3rd ed.). Academic Press, ISBN: 9780120885091. Motulsky, H. (2013). Intuitive biostatistics: A nonmathematical guide to statistical thinking. Oxford University Press, ISBN: 978-0-19-994664-8. OpenAI (2022). ChatGPT. Accessible at: https://chat.openai.com/chat.
Supplemental: Why square the difference, and square root the final measure? Recall, standard deviation is trying to answer the question, “How far from my measure of central tendency is my data?” or “Just how steady is this steady state responding?” Hopefully, our measure of central tendency is very close to splitting our data down the middle. That means that subtracting our measure of central tendency from each data will cause about half of the differences to be positive and half of them to be negative (Fig. A.1). The wrinkle comes when we try to add up these differences because, if we have done a good job of cutting our data in half with the measure of central tendency, then the sum of the
97
Just how stable is responding? Estimating variability
Data with More Variability
Data with Less Variability
< Raw Data >
0, 1, 2, 3, 3, 4, 5, 5, 6, 7
3, 3, 3, 3, 4, 4, 4, 4, 4, 4
Arithmetic Mean
3.6
3.6
-3.6, -2.6, -0.6, -0.6, 0.4, 1.4, 1.4, 2.4, 3.4
-0.6, -0.6, -0.6, -0.6, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4
Statistic
Math Symbol
−
Difference from Mean Sum of Difference from Mean
(
− )
0
0
Sum of Squared Differences
(
− )2
44.40
2.40
(
− )2
4.44
0.24
2.11
0.49
Average Squared Difference from the Mean
Standard Deviation
6
3
0
8
0
8
Figure A.1 Visualization for why standard deviation (and any other equations calculating difference scores) involves some kind of data transformation to make it sensical.
differences will be at or very close to zero, regardless of how spread apart or close together our data is (fourth row Fig. A.1). This is a problem because the whole point of this description of our data is to communicate whether the data we are looking at has a lot of variability or little variability. So this equation does not really meet the function for what we are after. Because of this problem, researchers developed a clever way to describe those differences between datasets. Instead of adding the raw
98
Statistics for Applied Behavior Analysis Practitioners and Researchers
differences with the positives and negatives, let’s square all of the differences and then add them up (fifth row Fig. A.1). Squaring all numbers (positive or negative) makes them positive, no matter what. Now we can add up all of these squared differences, divide that number by how much data we have, and voila—we now have a description of the variability in our dataset that better captures how close or spread apart our data are (sixth row Fig. A.1). The only final catch is that the average squared difference from the measure of central tendency is still in squared units. But, we often want to talk about our data as they are, not in squares of what they are. For example, we want to talk about responses per minute, not responses squared per minute. So the final step is to simply undo that transformation by taking the square root of our data (bottom row Fig. A.1).
CHAPTER
5
Just how good is my intervention? Statistical significance, effect sizes, and social significance We had a joke about statistics, but it wasn’t significant.
Introduction Are you still with us? We hope all that talk about variability in Chapter 4 didn’t increase your reading variability toward another book. (Although, if it did we wouldn’t blame you as variety is the spice of life.) But, turning to matters more directly at hand, this chapter represents the last of what we consider the base of statistical literacy and application for behavior analytic practice and research. We’ve built this foundation brick-by-brick with those bricks including demystifying what statistics are and making the case for the relevance of statistics for behavior analysts (Chapter 1); data types and distributions common to behavior analysis (Chapter 2); and using point estimates and central tendency to describe our data (Chapter 3) as well as describing variability present in our data (Chapter 4). Chapters 24 fall under the umbrella term of descriptive statistics. That is, methods to describe our data succinctly using numbers. But, often behavior analysts are interested in more than simply describing their data. Rather, they want to know whether the data from one condition (e.g., baseline) or from one context (e.g., home) differs from the data collected in a second condition (e.g., intervention) or a second context (e.g., school). In statistical jargon, we want to infer whether our intervention caused a change in behavior, as well as the magnitude and social significance of the change. This is the focus of this chapter: approaches to quantifying the effect of an intervention on behavior change1. Before we can get there, however, give us a minute to change 1 Unintentionally, a second goal of this chapter became to see if we could set the record for the most footnotes contained within a single behavior analytic book chapter. We’re not sure what the record is. But we suspect we’re in the top five all-time. And, with the exception of this footnote, we sincerely tried to avoid being superfluous with these.
Statistics for Applied Behavior Analysis Practitioners and Researchers. DOI: https://doi.org/10.1016/B978-0-323-99885-7.00005-2 © 2023 Elsevier Inc. All rights reserved.
100
Statistics for Applied Behavior Analysis Practitioners and Researchers
from our surveyor outfits back into our bricklayer outfits so we can lay the next brick. To help orient you, consider the following hypothetical study2. A behavior analyst working in a medical center is interested in preventing pressure sores3 from developing for adults who use wheelchairs. To address this, they developed an intervention to promote wheelchair push-ups that consisted of attaching a device to the underside of the wheelchair that produced a brief audible alarm if a push-up was not completed every 30 minutes (if a push-up is completed within each 30 minute timeframe, the alarm didn’t sound). To evaluate this intervention, the researcher randomly assigns participants to experimental and control groups with the experimental group contacting the alarm device intervention and the control group contacting educationas-usual. After some time, the researcher collects data on the percentage of opportunities that a wheelchair push-up was completed and has the participants complete an evaluation about the degree to which they found the intervention and outcome acceptable4. In doing so, the researcher now has multiple pieces of information through which to interpret the alarm device intervention—these being statistical significance and effect size from the push-up data, and social significance from the intervention evaluation data. Though described simply, the number of possible analytic outcomes that stem from these few pieces of information is actually quite large! Consider if the researcher obtained a p value categorized as statistically significant, an effect size categorized as large, and social significance
2
Aspects of this example were inspired by or derived directly from White et al. (1989). Additional inspiration was drawn from an example described by Kirk (1996). 3 Details about pressure sores—their development, treatment, and prevention—are a bit tangential for this illustrative example; however, for some general information, pressure sores develop following prolonged contact between a part of the body and a surface, can be associated with undesirable outcomes (e.g., infection), and can be costly to treat. Pressure sores are a concern for wheelchair users given the amount of time the buttock is in contact with surfaces (e.g., bed, wheelchair seat). Primary prevention involves regular pressure relief, which may take the form of repositioning the body (if in bed) or doing a “wheelchair push-up” (positioning the hands on the wheelchair handles to provide leverage to elevate the body upward, thereby eliminating or relieving contact between the buttock and the seat). 4 We know, as a behavior analytic intervention, we also would be tracking changes in behavior over time and would use visual analysis of time series data. But, time adds many wrinkles we are not quite ready to handle yet. We get to those in Chapter 8 once a few more important topics are flushed out.
Just how good is my intervention? Statistical significance, effect sizes, and social significance
101
data categorized as acceptable. Our interpretation here is quite easy and intuitive. We can feel confident: ruling out sampling variability as the explanation for the difference between the groups, that a practically relevant effect is present, and that other people are likely to find the intervention favorable. With this “fact in the bag,” researchers should feel comfortable extending this work to further improve upon or to evaluate components of the intervention. And, for practitioners, it would be a no-brainer to add this intervention to their practice toolbox. But, results aren’t always this clean. Now consider how the interpretation gets complicated if the researcher obtained a p value of 0.06 which is categorized as statistically not significant, an effect size categorized as moderate, and social significance data categorized as acceptable. Or consider a situation where the result is statistically significant, the effect size is categorized as moderate, and social significance as unacceptable. It might be tempting to dismiss interventions that fail to meet the highest of significance levels across all three categories. However, the presence of at least one significant outcome (statistical, effect size, or social) suggests that there might be something worth pursuing here. What then? For the researcher, it might be an easier decision. Just claim agnosticism and that “more research is needed.” But for the practitioner who needs to make a decision today, the decision is not so easy and other factors likely need to be considered (e.g., evidence for alternative interventions, client characteristics that might inform effectiveness). We hope this example paints two pictures that form the purpose of this chapter. First, decisions around intervention effects are sometimes straightforward and other times less so. We often have to say “yes” or “no” when asked whether an intervention was effective for someone. But, life is rarely binary5. Second, claims around intervention effects can be measured in many different ways with each method telling you something unique. All are statistical tools and, as tools, have contexts in which they are useful and contexts in which something else is likely better. With these contexts in mind, let’s focus our lens on each of these types of outcome measures and what they can and can’t tell us.
5
Don’t tell the computers this, though. We don’t want to instigate an uprising in this age of AI.
102
Statistics for Applied Behavior Analysis Practitioners and Researchers
Statistical significance We know. Trust us, we know! We know that you likely read the “Statistical Significance” subheading and moaned, groaned, and rolled your eyes. For many of us, a substantial proportion of our statistical (mis?)education focused on null hypothesis significance testing (NHST), the p value, and attempts to interpret that p value. In more cases than not, the likely outcome was pure confusion and chaos6. We understand—an accurate understanding of statistical significance isn’t intuitive and likely involves some light brainwashing. Fortunately for you (and for us), our focus here isn’t on calculating p values (there’s a plethora of sources available for such purposes). Rather, our focus is on ensuring that behavior analysts understand exactly what a p value tells you, why p values matter, and some of the limitations or misunderstandings associated with NHST and p values7. If we asked you to tell us some things that come to your mind when we say “statistical significance,” we’d wager you might say things such as “p value,” “p less than (0.001, 0.01, or 0.05),” “null hypothesis,” “statistically significant or statistically not significant,” and a few curse words sprinkled here and there. All those responses are relevant and expected. To make full sense of those responses (sans curse words, no explanation necessary for those) and as a means to provide a common sense discussion on NHST and p values, we’ll use an example8 that we hope resonates.
6
Surely this is nothing to fret over, even some of the best and brightest have seemingly mistaken what a p value represents. For example, in his classic text, Sidman (1960) seemed to have conflated p values with the likelihood of replication, noting that “A given experimental operation may, in reality, have no significant effect. But a series of replications is likely to yield a few estimates of statistically significant differences between experimental and control observations. . . Similarly, even if the experimental variable does have a real effect, a series of replications is still likely to yield a few instances of statistical nonsignificance” (p. 45). All we’ll review soon, p values are not related to the probability that an observed outcome is replicable. Regardless of Sidman’s potential confusion, he squarely hit the nail on the head when he rightly advocated that conclusions regarding replication (and generality) are derived from—you guessed it—replications themselves (direct and systematic). 7 You might be surprised to learn that folks—including statisticians—have written about such issues for over 80 years (e.g., Berkson, 1938). This means the issues we’re going to tell you aren’t original and certainly aren’t of just recent interest. But, after all, this is a statistics book and it’d be a bit weird if we didn’t include this for completeness. 8 If additional examples would be beneficial, we direct the interested reader to Chapter 15 in Motulsky (2018).
Just how good is my intervention? Statistical significance, effect sizes, and social significance
103
Example: A researcher has developed a manualized intervention to increase the IQ of individuals with autism spectrum disorder (ASD)9. Based on previous work, the researcher guesses (i.e., hypothesizes) that this intervention will increase IQ by at least 15 points compared to not experiencing the intervention. To determine whether the intervention actually has this effect requires logic10 that ties your documented control of conditions to the data on behavior from those conditions. In simple terms, the logic runs as follows: “Given how well I controlled my conditions and the data I am seeing across conditions, the intervention is the most probable reason why behavior changed”; or, “Despite everything I tried to control in my conditions, the data I am seeing across conditions suggests I am missing something important that influences my target behavior.” Note the binaryness of this claim and the challenges we highlighted previously to thinking in this way. In any case, it is often the former claim that researchers or practitioners try to prove. In any science, it is tough sledding to prove anything. This is because proving something requires sufficiently excluding a host of other reasons why something may have happened. Such explanatory capabilities are likely out of humans’ reach until we can measure everything, everywhere, and all at once. As such, an easier pivot is to turn the logic toward disproving a hypothesis (in statistical jargon, the null hypothesis), which is typically an inverse of the hypothesis we really care about (in statistical jargon, the alternative hypothesis). Returning to our example, the null hypothesis would be, “Experiencing the new intervention will not increase IQ by at least 15 points compared to not experiencing the intervention”11.
9
The astute reader likely considered the similarity to Lovaas (1987), which served as the source of inspiration for this example. 10 Which is just verbal behavior we have learned within our lifetimes. Fun fact, most of us are used to the logical system (i.e., cohesive set of verbal behavior) that extends to ancient Greece. But, as with all human behavior, this Western logical system has inconsistencies and alternative systems of logic exist (e.g., Avicenna & Forget, 1892). Let us know if you choose to ride this wave. And, reach out so we can chat over a bit at the next conference about how you like the variations in logical waters. 11 Often, but not always, the null hypothesis is also a nil hypothesis wherein there is no difference (aka, zero difference) between groups. Although this is not the case for the current example, the astute reader will likely recognize that any nil hypothesis is likely to be false even before collecting any data. The reason being is that it is almost a certainty that some difference, even if miniscule, will appear in the calculation. See below for further details.
104
Statistics for Applied Behavior Analysis Practitioners and Researchers
Once our null hypothesis has been stated, the next task is to select what is known as a significance level (represented by the Greek letter alpha, α). The significance level (α) is a value that signifies with what probability the researcher is willing to make an error about claims related to the null hypothesis. Historical providence gives us the likely familiar α 5 0.0512 meaning that the researcher wants to be correct 95% of the time and finds it acceptable to be incorrect 5% of the time. Said differently, α 5 0.05 means that, if the null hypothesis is actually true (i.e., the intervention has no effect on behavior), the probability of incorrectly rejecting it is 0.05 or 5%. It’s typically best practice to determine your analytic strategy before you start an experiment or conduct an intervention. Assuming the behavior analysts in our example followed best practice, the next step would be for them to conduct their study. To do so, they implement the new intervention with one group of participants (the experimental group) and do nothing differently than they already were for the other group of participants (the control group). After the intervention is completed, the researcher measures IQ for all the participants, conducts the appropriate statistical analyses based on the type and distributions of the data (Chapters 24), and obtains a p value 5 0.0313. Nice! That’s less than our 0.05 cutoff so we must be geniuses and our intervention the greatest thing since sliced bread14. Or is it? Given the weird logical turns we made above, what exactly does that p value mean? Here is where chaos and confusion reign supreme. Humans the world over have been disoriented by the logical gap between what we want to know and the logical pivots we had to make based on what’s realistically possible to measure and to logically claim. Before we break this down, we encourage you to take a quick Of course the selected α is an arbitrary threshold—any α value can be selected. If that’s the case, you might be thinking, well, I don’t want to be wrong, so why don’t we specify a very, very, very small α? It’s not as simple as that, and doing so might still lead to incorrect assumptions being tendered. The reason is that there’s an important trade-off related to alpha and Type I and II errors that must be considered. Specifically, as alpha decreases in value, there’s a decrease in Type I errors (aka, false positive) and an increase in Type II errors (aka, false negative). On the flip side, as alpha increases in value, there’s an increase in Type I errors and a decrease in Type II errors. This give and take is an important consideration when specifying your significance level. 13 Historically, researchers have reported obtained p values as either a continuous quantity (e.g., p 5 0.03) or as an inequality (e.g., p , 0.05). Current best practice is to be precise and to do the former (Wasserstein et al., 2019). 14 For an interesting claim around the origin of this saying, see Molella (2012). 12
Just how good is my intervention? Statistical significance, effect sizes, and social significance
105
break, go on a brisk walk, make a cup of coffee or tea (we recommend white or oolong), and pick this book back up when you’re ready to perform some mental jiu-jitsu to overcome any previous miseducation, misunderstanding, misconceptions, misperceptions, misinterpretations, misreading, or misconstruction related to statistical significance. [Pause for tea making or a brisk walk. Might be best to stretch and warm up the muscles, too, so you don’t pull anything.] Okay, so what does a p value 5 0.03 mean? Generally speaking, a p value tells us the likelihood of obtaining a difference as large or larger than the one we observed with our data if the null hypothesis is true. Restated, a p value 5 0.03 means that if the null hypothesis is true there is a 3% chance we would obtain a difference at least as large as the one observed. Alternatively, it also means that if the null hypothesis is true there is a 97% chance of obtaining a difference smaller than the one observed15. So the researcher has calculated their p value and understands what it means and finds that the obtained p value (p 5 0.03) is less than the threshold they chose before the experiment was conducted (α 5 0.05). And, in case you missed the hints to this point in this paragraph, this is all about whether the null hypothesis is true or not. It says nothing about whether the alternative hypothesis is true or not which is often what we care about most. So, here, the researcher can reject only the null hypothesis (p 5 0.03 , α 5 0.05) and, historically, such an outcome would be labeled as statistically significant16,17. To help firmly establish the concept of the p value, you also need to understand what a p value doesn’t tell us. What do p values not tell us? Well, many things. They can’t tell you what the temperature will be tomorrow nor which outfit to wear. And, they can’t tell you the best birding spot in Ecuador nor when flights are cheapest to Australia. All kidding aside, they also say nothing about the practical importance of the outcomes, nor their reliability or generality. They say nothing 15
Note well that neither p value explanation says anything in any way about scientific or practical importance. 16 We qualified the label of this outcome as “historically” because current best practice as offered by Wasserstein et al. (2019) is “. . .‘statistically significant’—don’t say it and don’t use it” (p. 2). 17 Consider how things might be interpreted differently if a different significance level was specified. For example, if the researcher set a more stringent value of α 5 0.01, this would have led to the researcher not being able to reject the null hypothesis because the obtained p value is more than the significance level specified (p 5 0.03 . α 5 0.01). Historically, such an outcome would be labeled as statistically not significant.
106
Statistics for Applied Behavior Analysis Practitioners and Researchers
about the probability our intervention actually had an effect. And, they prove nothing. To reiterate this point, Motulsky (2018, pp. 139140) outlined a list that is useful repeating here: • The P value is not the probability that the result was due to sampling error. • The P value is not the probability that null hypothesis is true. • The probability that the alternative hypothesis is true is not 1.0 minus the P value. • The probability that results will hold up when the experiment is repeated is not 1.0 minus the P value. • A high P value does not prove that the null hypothesis is true.
Common criticisms of null hypothesis significance testing Following from the example detailed above, the related notes, and the list just provided, we are well poised to dive deeper on the common criticisms of NHST and the use of p values in science. To structure this summary, we categorize criticisms into three buckets as others before us have done (e.g., Ferguson, 2009; Kirk, 1996). First, NHST doesn’t provide the information that the researcher or practitioner often wants, even if they refuse to believe it18. What do researchers want to know? We want to know the likelihood that the alternative hypothesis is true—that our intervention changed behavior. But, having just read the list in the previous paragraph—you and we know that’s not what is derived from NHST and the resulting statistics. Rather, as reviewed, NHST derives the likelihood of obtaining a difference as large or larger than the one observed if the null hypothesis is true. As noted by Kirk (1996), the difference is most easily represented using mathematical notation: p(H0|D)6¼p(D|H0). In words, the probability that the null hypothesis (H0) is true given the data we obtained (D) is not the same as the probability we would obtain these, or more extreme, data if the null hypothesis is true19. Or as noted by Cohen (1994), “. . .it does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!” (p. 997). 19 See also Cohen (1994) for a more detailed description of why p(H0|D)6¼p(D|H0). To illustrate, you can also swap in fun examples such as the probability it rained because there are clouds is not the same as the probability there are clouds because it rained. In many situations, you might have clouds without rain. But, it’s very unlikely you’ll have rain without clouds (though, sun showers, anyone?). Another classic example we don’t particularly like but it makes the point salient is that the probability someone is dead because they were hung is not the same as the probability that someone was hung because they are dead. The former is often near certain, the latter is (statistically) very low as people die for all sorts of reasons. 18
Just how good is my intervention? Statistical significance, effect sizes, and social significance
107
The second bucket of criticisms around NHST involves the likelihood of there ever being absolutely no difference in measurements of behavior between two groups (group designs) or between two conditions (withinsubject designs). To help clarify this issue, consider the oft-employed null hypothesis that there is no difference between the experimental and control data. In such situations, the null or nil hypothesis is almost always false (Cohen, 1990, 1994). The reason is that it’s nearly certain that some amount of difference (even if miniscule) will be observed and close to impossible that perfectly identical data will be collected between experimental and control groups. From this, the second critique argues that deriving and interpreting a p value is a fool’s errand. Or, as noted by Tukey (1991), “All we know about the world teaches us that the effects of A and B are always different—in some decimal place—for any A and B. Thus asking, “Are the effects different?” is foolish” (p. 100). Such logic leads to the reality of no likelihood of a Type I error and only the possibility of a Type II error (Schmidt, 1992)20. The last critique of NHST filling our third bucket relates to the fixed and arbitrary level of significance that determines whether results are statistically significant. Creating a “significance level” toward which to evaluate a p value sets up an arbitrary and unnecessary dichotomy. That is, our results are either statistically significant or statistically not significant. We admit that dichotomous, yesno, answers are appealing and require less interpretive effort. The reality, however, is that there is no natural basis or need to do so. That is, we may pursue to “carve nature at its joints,” but doing so is based on an assumption that nature has such joints. Or, as eloquently noted by Rosnow and Rosenthal (1989), “We want to underscore that, surely, God loves the 0.06 nearly as much as the 0.05” (p. 1277). Rather, it’s seemingly more important to view significance (statistical or otherwise) as a continuum that’s interpreted relative to the context within which we are evaluating an intervention. With all the fuss mentioned above, you might be tempted to conclude that it’s okay to simply abandon statistical significance and forget that you read the previous pages. To do so, however, would be missing the point. We agree with the wisdom of folks much wiser than A Type I error 5 a false positive wherein the null hypothesis is actually true, but the researcher decides to reject it. However, this can never be the case if the null hypothesis is always false. A Type II error 5 false negative wherein the null hypothesis is false, but the researcher decides not to reject it. See Chapter 6 for a full discussion on error rates and other fun classification metrics. 20
108
Statistics for Applied Behavior Analysis Practitioners and Researchers
us that the aim of the above critiques is not to dismiss significance testing altogether. Rather, the aim of understanding the critiques above should be twofold. First, so you can accurately understand, talk, and write about p values. Second, so you can expand and become more savvy in how you think statistically about the results of research21 (Wasserstein et al., 2019). With those two aims in mind, let’s briefly review the common NHST approaches you are likely to read about or use in your daily research and practice.
Comparing two datasets In arguably the simplest of scenarios, a behavior analyst would want to know whether two sets of behavioral data differ. For example, perhaps they have data from a baseline and intervention condition for a client, rates of skill acquisition for verbal behavior and motor imitation targets, or on participation from students in two different classrooms. In these scenarios, we want to know whether behavior differs as a function of reinforcement contingencies, behavioral topography, or setting, respectively. This gets us into the realm of “inferential statistics.” It is “inferential” because we simply cannot measure all behavior for all individuals and all contingencies at all points in time such that the data we have collected is necessarily a sample of a much larger population of responses. Without a complete picture, we have to infer something about the client(s)’ total behavioral repertoires using our limited sample to estimate the probability we observed the behavior we did across conditions. Again, assuming behavior actually does not differ across conditions. Most inferential statistical tests can be classified under one of two categories, parametric or nonparametric22 (Fig. 5.1). In a nod to Chapters 24, a parametric statistical test just means we can compare two or more parameters (i.e., descriptions) of our datasets (e.g., arithmetic mean, standard deviation) because the distribution of our data provides accurate metrics to describe central tendency and variability. When, for whatever reason, the distribution of our data is. . .not quite 21
This shouldn’t be surprising to our readers who consumed Chapter 1—after all, expanding statistical thinking for behavior analysts was a driving factor for us in writing this book. 22 Throughout this section, we are going to avoid providing the equations and details around implementing. This is because many tutorials already exist online to walk you through this for your preferred analytic software (e.g., Excel, SPSS, R, Python). Instead of the details, we hope you walk away with a higher level understanding of the conditions under which each type of test is likely most appropriate.
Just how good is my intervention? Statistical significance, effect sizes, and social significance
109
Figure 5.1 Decision tree for choosing among common statistical tests.
right. . .for parametric statistical tests, we can use the second broad category of statistical tests uncreatively called nonparametric statistical tests. They are called nonparametric as they do not use point estimates of central tendency or variability to compare the raw datasets. The most common parametric statistical test for comparing the two groups you will likely encounter is the t-test. A t-test is used to compare the arithmetic mean of two sets of data when (1) the dependent variable (DV) is continuous and approximately normally distributed and (2) the difference between the two groups is a categorical or ordinal variable. Variations here surround whether the two sets of data are linked in some way (e.g., from the same person, matched samples; paired t-test) or are unrelated (e.g., students in different districts or caseloads; two sample t-test). When your data are not normally distributed23, the common nonparametric alternatives are the Wilcoxon signed rank test or the Mann-Whitney U test. The Wilcoxon signed rank test is appropriate when your data are from the same person or
23
A common test to determine whether this assumption is violated is the Shapiro-Wilk normality test. Most statistical packages will kick out the results of this test if you run a t-test. If not, you can always Google how to do this with your preferred analytic software.
110
Statistics for Applied Behavior Analysis Practitioners and Researchers
matched samples. The Mann-Whitney U test is appropriate when your data are from independent samples.
Comparing three or more datasets Parametric tests comparing three or more datasets start to get a bit more complicated with the most appropriate test requiring you to answer just a few more questions (Fig. 5.1). These questions are: (1) How many independent variables (IV) do I have24; (2) is my DV approximately normally distributed; (3) were the data collected from completely separate groups, from repeated observations with the same individuals, or both; and (4) are there any characteristics of my participants that are significantly correlated with my DV? If your answer to the first question is “I have 1 IV,” then this paragraph is for you. Examples here would be asking about differences in behavior based only on the intervention condition, behavior therapist who works with the client, or intervention setting. Your answer to the second question will, again, determine whether you use a parametric (normally distributed) or nonparametric test (not normally distributed). Your answer to the third question will determine whether you use a one-way ANOVA25 (independent groups, parametric), repeated measures ANOVA (dependent groups, parametric), Kruskal-Wallis (independent groups, nonparametric), or the Friedman test (dependent groups, nonparametric). Your answer to the fourth question determines whether you use the ANOVA variants just described (no correlations between participant characteristics and your DV) or have to pivot to ANCOVA26 (parametric test when some measure in your data correlates with your DV) or the nonparametric equivalents (e.g., Puri & Sen, McSweeny & Porter). Fig. 5.1 shows all this as a handy decision tree. If your answer to the first question is “I have 2 IVs,” then you are in the land of two-way ANOVAs, mixed-model ANOVAs, the corresponding ANCOVAs, and their nonparametric equivalents. Here, the critical questions are the same as the previous paragraph to determine whether you use ANOVAs, ANCOVAs, or their nonparametric 24
Note that this is not asking about the levels of your IV, just how many you have. So, if I am comparing behavior across three different classrooms, I would have three datasets but only a single IV (i.e., classroom) with three categorical labels for that IV. 25 ANOVA stands for ANalysis Of VAriance. 26 ANCOVA stands for ANalysis of COVAriance.
Just how good is my intervention? Statistical significance, effect sizes, and social significance
111
equivalents. The only additional question in this paragraph is whether the two IVs you are using are independent or one is dependent. If the two IVs are independent (e.g., two categorical variable labels assigned to independent groups) then the two-way ANOVA or its nonparametric equivalent is your friend. If one IV is a between-group measure and one is a within-group measure (e.g., repeated measures of students in two different classrooms) then mixed-model ANOVAs or its nonparametric equivalent is your friend. As with above, Google or Bing will also be your friend here in identifying how to implement these tests in your favorite analytic software. By now we hope you are simply familiar with the questions you need to answer in order to Plinko27 your way down to the most appropriate statistical test based on the data you have and their underlying distribution.
Categorical dependent variables and independent variables All the tests in the previous section were for continuous data types or, perhaps, ordinal data types of a sufficient number of levels. Chisquared tests are the common statistical workhorse when your DV is categorical or ordinal with a few levels, your DV is frequencies or counts, the counts are independent of each other, and you also have categorical or ordinal IVs. The nice thing about chi-squared tests is that they are technically nonparametric so are quite flexible but do have some assumptions you should be aware of (see McHugh, 2013 for an in-depth discussion).
Summary of statistical significance NHST has been one of the dominant historical approaches to describing intervention effects in Psychology and Education outside behavior analysis. Given how likely it is that you will contact statistical significance at some point in your life, you should be well versed in what NHST means, its drawbacks, and the benefits of significance testing broadly beyond NHST. Though significance tests exist that tell you what you want to know (did my intervention change behavior?), authors of scientific articles most frequently report on NHST which suffers from the challenges that it only tells you about the probability you obtained your data assuming the null hypothesis is true, it is close to impossible that the null hypothesis will ever be true, and often unnecessarily turns a continuous description of intervention effects into 27
Those reasonable people may disagree, this is arguably the best Price is Right game of all time.
112
Statistics for Applied Behavior Analysis Practitioners and Researchers
a dichotomous description based on an arbitrary cutoff. Despite these challenges, its prevalence suggests that audiences outside behavior analysis will continue to speak this language. Thus, for the time being, it seems worthwhile to learn to understand and speak this language. The biggies here to know are t-tests, ANOVAs, ANCOVAs, and chisquared tests; the nonparametric equivalents to each in Wilcoxon or Mann-Whitney U, Kruskal-Wallis or Friedman, and Puri and Sen or Mcsweeney and Porter test, respectively; and when you would need two-way or mixed models to extend the above tests. If you leave this book knowing only these things, then we feel we have done our job. To conclude this section, a de-emphasis around all the tests we just told you about seems to be in order. If NHST is, well, essentially logically unhelpful, what do we do instead? Unfortunately, no single or straightforward alternative currently fills that bill and there likely won’t be one anytime soon (Cohen, 1994; Wasserstein et al., 201928). Nevertheless, a growing movement in many sciences is in calculating and reporting effect sizes alongside any tests of statistical significance. So, with that in mind, we next focus our lens on the land of effect sizes.
Effect sizes Behavior analysts often aim to manipulate some variable (our IV) to influence some change in a behavior of interest (our DV). From there, our interest rightly includes not just whether behavior changed and whether that change was a function of the IV, but also in what direction (did things get better or worse?), how long it takes to get there (Two seconds? Two years?), and by how much (A lot? A little?). Furthermore, we are typically interested in our answers to these questions compared to when others have tried to solve a similar problem in the past. Answering these questions isn’t unique to behavior analysis. Researchers and practitioners spanning most, if not all, professions are also interested in the direction, speed, and magnitude of their experimental manipulations, and how their results compared to the results of others29.
To illustrate, consider that in a special issue of The American Statistician titled “Statistical Inference in the 21st Century: A World Beyond p , 0.05” there were 43 (yes, 43) papers about what “to do” statistically in a world where statistical significance is de-emphasized. 29 Said more eloquently by Cohen (1990), “. . .science is inevitably about magnitudes. . .” (p. 1309) 28
Just how good is my intervention? Statistical significance, effect sizes, and social significance
113
It’s relatively easy to understand why direction, speed, and magnitude of an effect are of interest to researchers and practitioners in most disciplines. We all want to take the medication that cures illness as quickly as possible, we want to enroll in a class for which as close to 100% of the students learn the material to fluency, we update our houses with windows and insulation that produce the largest energy savings relative our budget, and we all want the downtrodden sports teams we root for to make coaching changes that produce the largest and quickest improvements in relative standings30,31. Obviously, few effects are always immediate or large. But, they illustrate the usefulness and ubiquitousness of our understanding of effects as a means to navigate our lives. To reiterate, our analytic interests often don’t end at only demonstrating that a functional relation exists between our IVs and DVs. We often are also interested in knowing the magnitude and direction of the effect, how that change compares to other approaches we could have taken, and how to make the effect even larger or to occur more quickly. Effect sizes are the metric of choice when conducting these kinds of analyses. Stated simply, an effect size is a quantitative metric that tells us the magnitude of the: (1) change in a variable (our DV) over time, (2) difference between groups (e.g., between control and experimental groups), or (3) relationship between two variables. And, despite the seemingly intuitive and widespread interest in understanding effects in a practical sense, neither David nor Jason recalls spending any time on these topics in the statistics courses we’ve completed. We assume the same may apply to you, our brave reader32. Given this, we’ll briefly introduce effect sizes, highlight some of the reasons why each is valuable, and present common methods to calculate and interpret effect size estimates33.
30
The Buffalo Bills hired Sean McDermott as their coach in 2017. The effect was immediate. The Bills made the playoffs in his first year as coach, ending the longest active postseason drought of any major North American sport. Jason says “Can I get a, ‘Go Bills!”? 31 David saw Jason’s footnote about the Bills so had to wave a Terrible Towel and list the Super Bowl winning years of the greatest franchise in NFL history: 1975, 1976, 1979, 1980, 2006, and 2009. 32 To illustrate this, we challenge you to dust off any research methodology and statistics textbook you have laying around and search the index for “effect size.” Anyone locating more than a paragraph of text on effect sizes in such a book wins the opportunity to take a selfie with David and Jason. 33 Readers looking for more in-depth (read: book length) treatments on effect sizes are referred to Aberson (2010); Cohen (1988); Cooper, Hedges, et al. (2019); Cumming (2012); Ellis (2010); Grissom and Kim (2005); and Murphy et al. (2012).
114
Statistics for Applied Behavior Analysis Practitioners and Researchers
Effect sizes provide a means for understanding and communicating about the results of our work from an everyday perspective—here we’re talking about putting a number on practical significance. Behavior analysts who have defended a thesis or dissertation have likely been subjected to some version of the question, “Why should I or anyone else care about this?,” from a committee member. The intent of such a question isn’t likely to stump the student, rather, it is to understand effects and their magnitudes34. The ability to communicate intervention effects precisely and numerically has many benefits. First, an effect size provides a metric that is separate from the unique measurement system used in any single study. This allows practitioners to look across different studies that use different measures and different methods to more easily compare interventions. For example, consider two independent studies both seeking to change the same behavior in the same direction; however, one study used a 5-point scale and the second used a 7-point scale. Using only raw change scores would make it difficult to compare the effects of these studies directly. To compare them, we have to standardize the outcomes so they are on the same scale. Effect sizes do this for us. A second benefit of effect sizes is that they allow researchers to pool effects across many independent studies. Referred to as meta-analyses, these scholarly works help identify evidence-based practices, the likely magnitude of an effect any one practitioner might expect with any one client, and subsequently can play a much more influential role on policy than any one research article in isolation. For a more thorough discussion of the relevance of meta-analyses for behavior analysts, we encourage readers to check out Dowdy et al. (2021). A third benefit of effect sizes is that they can be leveraged in an a priori fashion to ensure studies are properly powered. Statistical power refers to the likelihood that we detect an effect of our intervention when such an effect actually exists. An insufficiently powered study is 34
Although this represents a departure from what tests of statistical significance aim to achieve, it does represent something many would consider more important. This sentiment is captured by Cohen (1990), who noted, “Next, I have learned and taught that the primary product of a research inquiry is one or more measures of effect size, not p values” (p. 1310). Similarly, Motulsky (2018) noted, “. . .don’t let yourself get distracted by P values. Instead, look for a summary of how large the effect was. . .” (p. 139).
Just how good is my intervention? Statistical significance, effect sizes, and social significance
115
at risk for failing to detect actual effects (i.e., Type II error). A priori power analyses allow researchers to identify the likely number of observations they need to detect an effect of a certain magnitude. Chapter 7 takes a deep dive into the topic of power analyses so we won’t go further here. But, alas, effect sizes are very handy for these calculations. A thorough treatment of the benefits and drawbacks of all effect sizes that exist is well beyond the scope of this text. However, as with statistical significance above, we’ll provide a brief primer on the effect sizes most behavior analysts have likely encountered and when they are likely to be most useful. To start this conversation, effect sizes can be categorized into one of at least three groups—risk estimates, group difference indices, and strength of association indexes. The first two are used to estimate differences between groups and the latter is used to estimate the association between variables.
Risk estimates Risk estimates are used when working with discrete data (see Chapter 2 for a refresher on categories of data) and claims around an effect relate to the probability that an observation will be classified into one out of two or more categories of interest (e.g., success/failure; sick/well; alive/dead; better/worse). Although some authors have noted that risk estimates are predominantly used in medical research and may not be of much use in psychology and education (Ferguson, 2009; Kirk, 1996), we elected to include them here for two reasons. First, their calculations are relatively straightforward (it’s nice to ease into things, right?). Second, some behavior analysts work in medical environments and consume medical research making these highly relevant to their work and the research they read. To work through calculation and use of risk estimates, consider a situation in which you recently completed all degree requirements of your behavior analysis program. You are beginning to prepare to take the Behavior Analysis Certification Board (BACB) exam and are considering enrolling in an exam preparation program. Given the cost and effort required to complete such a preparation program, you’re unsure if the juice is worth the proverbial squeeze. In reviewing the information on the website for a test-prep program called “ExamPass for the Big Bad Scary Test by the Magnificent David & Jason,” you note they
116
Statistics for Applied Behavior Analysis Practitioners and Researchers
Table 5.1 Hypothetical data for risk estimate calculations. Group
Fail
Pass
No Program
A (50)
B (50)
Program
C (10)
D (90)
provided a table containing pass/fail data for individuals who have enrolled in the program compared to individuals who did not (see Table 5.1). To better understand how big of an effect their test-prep program had on past consumers of their product, we can calculate common risk estimates such as risk difference (RD), risk ratio (RR), and odds ratio (OR). Because the biggest risk involved in this hypothetical scenario is to fail the exam, that will be our primary focus. The easiest risk estimate calculation is likely risk difference. Calculating RD involves simply subtracting the proportion of individuals who did enroll in the Program who failed the exam from the proportion of individuals who did not enroll in the Program and who failed the exam. As an equation using labels from Table 5.1 (Rosnow & Rosenthal, 2003): RD 5
A C ðA 1 BÞ ðC 1 DÞ
(5.1)
To complete the calculation, we plug the hypothetical number of individuals from Table 5.1 into Eq. 5.1 and solve—RD 5 50/(50 1 50)10/ (10 1 90) 5 0.4. An RD of 0.4 means that there is a 40% RD between the No-Program and Program groups. Stated differently, there is 40% more risk of failing the BACB exam when not enrolling in the Program compared to taking the Program. So, should you give Jason and David your money?35 A second type of risk estimate is relative risk or a risk ratio. This equation is also relatively straightforward and involves the same information as risk difference. However, instead of subtracting one from the other, we divide one by the other. As an equation (Rosnow & Rosenthal, 2003): h i RR 5 h
A ðA 1 BÞ
i
C ðC 1 DÞ
35
(5.2)
Well, this is awkward given you already have purchased this book. In all sincerity, though, thank you for doing so. We hope you are enjoying reading the book thus far and the money spent was well worth it.
Just how good is my intervention? Statistical significance, effect sizes, and social significance
117
To complete the calculation, we plug the hypothetical number of individuals from Table 5.1 into Eq. 5.2 and solve—RR 5 [50/ (50 1 50)]/[10/(10 1 90)] 5 5. Interpreting RRs is based on the following. An RR 5 1 means the risk of the outcome is the same for both groups; an RR , 1 means the risk of the outcome is less in the first group (those that do not take the Program; numerator group in the equation) compared to the second (those that do take the Program; denominator in the equation); and an RR of . 1 means that the risk of outcome is less in the second group (i.e., the numerator is bigger than the denominator). In our situation, an RR of 5 means that the risk of failing the BACB exam for those that do not take the Program is 5 times greater than that of those who do take the Program. For those of us who frequently bet on sports, thinking in terms of ORs might be more intuitive. Calculating an odds ratio requires three steps. First, we calculate the odds that one of the outcomes occurred for the No Program group. Second, we calculate the odds that one of the outcomes occurred for the Program group. Lastly, we divide one odd by the other to get the ratio of the two odds (Eq. 5.3; Rosnow & Rosenthal, 2003). As an equation: OR 5
A=B C=D
(5.3)
Again, to complete the calculation, we plug the hypothetical number of individuals from Table 5.1 into Eq. 5.3 and solve—OR 5 (50/ 50)/(10/90) 5 9. Interpreting the OR is based on the following. An OR 5 1 means there is no difference in odds between the groups; an OR , 1 means the odds of the outcome (failing the BACB exam) are less likely in the first group (those that did not take the Program; the numerator in the equation) compared to the second (those that did take the Program); and an OR . 1 means the odds of the outcome are less likely in the second group (the denominator in the equation). In our example, the OR 5 9 means that the odds of failing the BACB exam are nine times higher for those that do not take the Program than those that do take the program36. One final issue deserving of space is the means by which we interpret whether a risk estimate translates into a meaningful, practical 36 Here, we calculated everything relative to failing the exam. You could also calculate the odds of passing and just swap the numerator and denominator for each group.
118
Statistics for Applied Behavior Analysis Practitioners and Researchers
effect. For example, how meaningful is an OR or RR of 5? Practically, is a 5 not meaningful, small, moderate, or large? Broadly speaking, some have cautioned against “ritualizing” cutoff point values in the same manner as has been done with statistical significance (Kirk, 1996). Instead, they argue that risk estimates should be interpreted relative to the context (Ferguson, 2009). Nevertheless, humans can’t seem to avoid making rules and so some have proposed scales for interpreting risk estimates. For example, Sullivan and Feinn (2012) proposed small, medium, and large effects coincide with OR of 1.5, 2, and 3 and RR of 2, 3, and 4, respectively. Assuming these guidelines capture some degree of truth, it sounds like risky business not to enroll in the program titled “ExamPass for the Big Bad Scary Test by the Magnificent David & Jason.”
Group difference indices When working with continuous variables (again, see Chapter 2 for a refresher; e.g., height, weight, temperature, IQ, duration, latency), estimates of intervention effects are typically evaluated by calculating differences between measures of central tendency (Chapter 2) while considering variability within and between groups (Chapter 3). Given how popular the arithmetic mean is in life, a popular group difference effect size measure is the standardized mean difference between groups. This metric involves subtracting the mean of one group from the mean of a second group (this is what is meant by “mean difference”) and dividing by the standard deviation (this is what is meant by “standardized”; for a standard deviation refresher, refer back to Chapter 4). Simple, right? Well, as with seemingly everything in life, it turns out there are many different ways to compare means relative to standard deviations. Commonly published options in behavior analysis include Cohen’s d (Eq. 5.4), Glass’s Δ (Eq. 5.5), and Hedges’ g (Eq. 5.6). Across equations, M denotes the arithmetic mean and SD denotes standard deviation. Here they are in all their glory: MExperimental MControl Cohen’s d 5 ; (5.4) SDPopulation or Pooled MExperimental MControl ; (5.5) Glass’s Δ 5 SDControl
Just how good is my intervention? Statistical significance, effect sizes, and social significance
Hedges’g 5
ðMExperimental MControl Þ : SDPooled with Sample Correction
119
(5.6)
Notice that the numerator involves the same calculation across all three measures—calculating the difference between the means of the experimental and control groups. And, the denominator uses some measure of standard deviation with the specific variation varying across the measures of variability. Cohen’s d uses the standard deviation of the population. This sounds straightforward. Simply compare the difference in means across datasets relative to the variability we would expect had we collected all possible data that exists on the topic. This is all fine and well, except one slight issue. In many (if not all) situations, the population standard deviation isn’t known. WTF! Thanks for nothing Cohen. Only kidding, of course. Thankfully, those who love mathematical proofs figured out that the pooled standard deviation works well enough37. To the extent that variability is similar across both datasets, this is a pretty logical fix for our inability to collect data on an entire population of people or the population of responses for one individual. However, this may not always be the case. Some interventions might increase or reduce variability relative to the control condition. Thankfully, some very smart people figured out some very smart ways of dealing with this issue which lead to the next option in Eq. 5.5. The second option is to calculate Glass’s Δ (Eq. 5.5). Here, the assumption is that our experimental condition is. . .well. . .experimental. Thus we have no clue what the standard deviation of this group should look like. Thus, to play it safe, we lean into what we are more certain about and simply use the control group standard deviation as our measure of variability38. Easy enough, right!? One last fun wrinkle about Cohen’s d and Glass’s Δ are that they are appropriate when the 37 By pooled, we just mean that the standard deviations from the two groups are combined into a single measure of variability. For the curiousrreader, the ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi formula for calculating the pooled ðSD21 1 SD22 Þ standard deviation for two groups is SDpooled 5 . In words, square the standard deviaðn11n2 Þ tion from group 1 and add that to the squared standard deviation from group 2. Divide that by the total sample size across both groups (some methods also will subtract the number of groups here). Lastly, take the square root of the value calculated in the last step. 38 For a full and more technical explanation of the logic behind using the control group, we direct the curious reader to Glass et al. (1981).
120
Statistics for Applied Behavior Analysis Practitioners and Researchers
number of observations (i.e., the sample sizes) is identical across groups. But, sometimes in life, we have more participants in one group compared to another, or we have more responses in one condition (e.g., the intervention condition(s)) compared to another (e.g., the baseline condition(s)). This gets us to our third option. When sample sizes differ, Hedge’s g is our best option for calculating mean difference scores. The logic and proofs are well beyond the scope of this book; however, Hedges (1981) showed that pooled estimates of standard deviations are biased when the sample sizes differ across the datasets being compared. In the same article, Hedges showed one method for how we can correct for this biased estimate of standard deviations by “weighting” each standard deviation by the group’s respective sample size. The resulting effect size is referred to as Hedges’ g (Eq. 5.6) and, fortunately for us, many statistical software packages do this work for us. With the calculations of group difference indices out of the way, the next reasonable issue to tackle is how to interpret the number we get following the calculation. Fortunately, this has been simplified for us in two ways. First, the above indices derive effect sizes expressed in standard deviation units (again, see Chapter 4 for the nitty-gritty on standard deviation). This is useful because we now have a standardized way to interpret effect size that will mean roughly the same thing across all studies that use the same effect size. For example, an effect size of 0.25, 1.0, or 2.5 (or whatever value is achieved) would mean that the difference between groups (e.g., experimental and control groups) is 0.25, 1, or 2.5 standard deviations, respectively. This is quite beautiful because the measure is no longer tied to any specific measurement scale. Two studies that both report an effect size of d 5 1.0 tells us immediately that the magnitude of the effect across both studies is the exact same when controlling for the measurement system. Similarly, if two studies reported effect sizes of d 5 1.5 and d 5 0.5, we could readily conclude that the effect of the first study was three times larger than the second study regardless of what was measured. The second reason that interpreting mean difference indices has been simplified is because Cohen (1969), in his pioneering work on effect sizes, outlined guidelines for interpreting the derived values that are generally well accepted. Specifically, effect sizes of 0.2, 0.5, and 0.8 indicate small, medium, and large effects, respectively. That said, to echo the sentiment
Just how good is my intervention? Statistical significance, effect sizes, and social significance
121
expressed above for interpreting risk estimates, Cohen and others have cautioned against rigid adherence to such guidelines. Instead, interpreting effect sizes should consider a number of factors39 such as the context, limitations of a study, degree of internal validity demonstrated, and effect sizes from past researchers (Ferguson, 2009)40. To round out our discussion of group difference indices, we’ll use an example with which we bet (OR anyone?) that behavior analytic readers will easily connect. Fisher et al. (2020) conducted a randomized controlled trial to evaluate the influence of a virtually delivered training program on parent implementation of a number of skills related to work and play activities with their children with ASD. To do so, the experimenters recruited a sample of participants, conducted pretests with all participants, randomly assigned participants to either the treatment or waitlist control group, delivered the intervention (ELearning Modules delivered virtually) to the treatment group41, and conducted posttests with all participants. Thereafter, the researchers calculated descriptive and inferential statistics to compare the groups on the DV (the percentage of opportunities that target skills were implemented correctly and the percentage of skills mastered). All p values were , 0.0005 indicating “statistical significance.” But the researchers didn’t stop there, also deriving effect sizes (d) which were all equal to or greater than 3.0. Meaning that the treatment group outperformed the waitlist control by at least 3 standard deviations (an effect considered large according to the Cohen guidelines)!
Strength of association indices To close out this section, we’ll discuss what we imagine will be a very familiar effect size metric for behavior analysts in strength of association indices. These indices are used when we’re interested in the 39
A full treatment of such consideration is outside our scope here, but we direct curious readers to Chapter 2 in Ellis (2010). 40 Although not included in our discussion, it is best practice to include confidence intervals when reporting effect sizes. Confidence intervals communicate a range of values and the likelihood that this range of values includes the true effect size in the population—said differently, confidence intervals provide information about the precision of our estimated effect size with small confidence intervals communicating greater precision compared to larger ones. For example, if we derive an effect size of d 5 1.0 and 95% confidence interval of 0.51.5, the resulting interpretation is that the population effect size would fall within this range of values (0.51.5) 95% of the time. 41 Participants in the waitlist control group were placed on a waitlist and were given the option to participate in the training (either virtually or onsite) following the posttest, see Fisher et al. (2020) for additional details.
122
Statistics for Applied Behavior Analysis Practitioners and Researchers
association between two or more dichotomous or continuous variables42,43. For the sake of clarity it might be helpful to begin by considering what is meant when we speak of an association. In a statistical sense, we are asking about the degree to which a variable is related to or tends to vary with (the technical term for this is covariation) another variable. For an intuitive example, typically the taller a person is the more they tend to weigh. That is, as height increases, weight increases as well. We can visualize this relationship by plotting each person’s combination of height and weight in a scatterplot. Here, the X values might represent height and are plotted along the abscissa (horizontal axis), and the Y values might represent weight and are plotted along the ordinate (vertical axis). Each data point on the plot would represent height and weight measurements from a unique person. As you likely recognized, questions regarding the association/relatedness/covariation amongst two or more variables is commonly expressed quantitatively as some variation of a correlation coefficient. Though most people often use “correlation” interchangeably with one type of correlation (Pearsons’ r), it turns out that there are many types of correlations you can calculate where the method chosen—probably no surprise here—depends on the data types you have and the form of the relation between the two variables. We already reviewed data types extensively in Chapter 2 so won’t rehash that here. So let’s dive into the missing pieces in the three features of association between variables and how they impact which correlation metric we choose. These three features are direction, form, and strength. Let’s talk direction. A correlation coefficient can take any value from 1 (perfectly negative relationship) to 11 (perfectly positive relationship). A value 5 0 indicates there is no consistent relation between the two variables. You can identify the direction of the relationship based on the sign of the coefficient, positive (1) or negative (). A positive correlation 42 We wish to point out that association indices provide flexibility in that we can evaluate the relationship between variables measured using different types of data (i.e., ordinal and nominal; see Chapter 2) and different measurement values (e.g., wealth measured in dollars and health measured in physician visits, Abbott, 2017). 43 Note, risk estimates and group difference indices evaluate differences (e.g., the degree to which outcomes differ across control and intervention groups), whereas strength of association indexes do not. However, it is common practice to discuss strength of association measures when discussing effect sizes because, “. . .the meaning of strength of association may be viewed as identical to the broader meaning of effect size as “the size of relation between two variables” (Hedges & Cooper, 1981 p. 534)” (Rosenthal, 1996, p. 40).
Just how good is my intervention? Statistical significance, effect sizes, and social significance
123
Figure 5.2 Examples of what different correlation coefficients look like (i.e., their form) when plotted in two dimensions.
coefficient indicates that the variables tend to be observed at similar relative levels (upper left panel, Fig. 5.2). That is, as measures of one variable increase, measures of the second variable also increase. For example, we might derive a positive correlation between height and weight because as height increases, weight also tends to increase; and vice versa, as height decreases, weight also tends to decrease. A negative correlation coefficient indicates that the variables tend to be observed at opposite relative levels—as one variable increases, the second variable decreases (upper right panel, Fig. 5.2). Negative correlations are sometimes referred to as an inverse relationship. Let’s talk form. In the context of correlations, form is derived by evaluating the shape created when our data are plotted as a scatterplot.
124
Statistics for Applied Behavior Analysis Practitioners and Researchers
Table 5.2 Common correlation metrics and data types for which they are most appropriate. Name
Data type and form most appropriate
Pearson’s r
Two continuous variables that are related linearly
Spearman’s ρ (Rho)
Two continuous variables that are related nonlinear monotonically
Kendall’s τ (Tau)
Two continuous variables that are related nonlinear-monotonically
Point Biserial
One variable is continuous and one variable is dichotomous (aka binary)
φ (phi)
Two dichotomous (aka binary) variables
Cramer’s V
Two categorical variables with any number of levels
The three most common general forms are linear, nonlinearmonotonic, and nonlinear-not-monotonic. A linear correlation is one in which the relationship between the variables generally looks like a line when plotted as a scatterplot (top panels, Fig. 5.2). That is, the data move in the same direction (positive or negative) and do so at a constant rate. A nonlinear-monotonic relationship occurs when we plot our data, the form does not follow a line (it is not linear—nonlinear), but the data never reverses levels in either the X or Y direction (lower left panel, Fig. 5.2). A final common pattern occurs when we plot our data in a scatterplot, the shape is nonlinear, and the data reverses levels in either the X or Y direction (lower right panel, Fig. 5.2). This is the exciting nonlinear-non-monotonic relationship. Let’s talk strength. The strength of a correlation is expressed as a numerical value between 1 and 11 with values closer to 21 or 1 indicating a stronger relationship than values closer to 0. You might be wondering, “How far from zero do we have to get for it to count as a meaningful association between the variables?” and “When is the “strength” of the relationship worth writing home (or a manuscript) about?” As with the rest of the statistics discussed in this chapter, several different sets of guideposts have been offered. For example, Rosenthal (1996) suggested that coefficients equal to or greater than 0.1 (or 0.1), 0.3 (or 0.3), 0.5 (or 0.5), and 0.7 (or 0.7) represent weak, moderate, strong, and very strong relationships, respectively. Alternatively, Ferguson (2009) suggested that 0.2 (or 0.2), 0.5 (or 0.5), and 0.8 (or 0.8) represent a minimum, moderate, and a strong effect, respectively44. Nevertheless, on par with the rest of this chapter, 44
We would remiss not to mention the age old adage that correlation does not equal causation. As such our interpretation of a correlation must consider that the relationship might be spurious (again, see Vigen, 2015 for some fun examples of spurious correlations).
Just how good is my intervention? Statistical significance, effect sizes, and social significance
125
the larger context in which the study was conducted and past research should determine how impressive the findings are from any one study. With the features of correlations out of the way, we can now get to the common correlation coefficients45 and their most appropriate use cases (Table 5.2). These include Pearson’s (r) which is appropriate with two continuous variables that are linearly related; Spearman’s Rho (ρ) which is appropriate with two continuous variables that are nonlinearmonotonically related; Kendall’s Tau (τ) which is appropriate with two continuous variables that are nonlinear-monotonically related46; point biserial (rpb) which is appropriate when one variable is dichotomous (aka binary) and the second variable is continuous; phi (φ) which is appropriate when both variables are dichotomous (aka binary)47 and they can be arranged in a 2 3 2 contingency table (similar to Table 5.1); and Cramer’s V which is appropriate with nominal (aka categorical) variables that can be arranged in a contingency table of any size48.
But, how about single-case experimental designs?! There’s an elephant in the room. Earlier, we framed effect sizes for data collected in the context of between-group experimental designs. You might say, that’s all fine and well, but not particularly relevant for behavior analysts49 as we primarily use single-case experimental designs (SCED). Obviously, the use of SCED doesn’t preclude an interest in effects and their magnitudes. As noted by Baer et al. (1968), “If the application of behavioral techniques does not produce large enough effects for practical value then the application has failed” (p. 96). Historically, behavior analysts using SCED have made judgments 45
We elected not to include relevant formulas or the assumptions that must be met for using these indices given the availability of statistical software (e.g., SPSS) for these tasks. Also, many statistical textbooks, a quick Google search, or even prompting ChatGPT (https://openai.com/ blog/chatgpt/) will return the details regarding the assumptions that need to be met and the corresponding calculations. 46 Kendall’s Tau is an alternative to Spearman’s Rho that tends to have better error sensitivity when the data types are normally distributed. 47 The astute reader may recall that when working with such variables, calculating a risk estimate is also feasible (Fleiss, 1996). See Rosenthal (1996) for an illuminating example of why this is the case. 48 You may have noticed we didn’t review nonlinear-not-monotonically related variables. If you have found yourself in this situation, you’re in a different area in this wonderful land of modeling. We get to this in Chapter 6. 49 The rationales for the historical behavior analytic disregard for such designs and their outcomes have been well described elsewhere (e.g., Branch & Pennypacker, 2013).
126
Statistics for Applied Behavior Analysis Practitioners and Researchers
of effects and their magnitudes through visual analysis. Visual analysis does have utility as a “quick’n’dirty” approach to data analysis. “Quick” because no software programs or sophisticated analytic methods are typically necessary; we only need our capacity to “see”50 the data and to consider the surrounding context51. And, “dirty” because visual analysis isn’t always reliable across behavior analysts or ultimately acceptable outside of behavior analysis (as discussed further in Chapter 8). A handful of statistical effect size indices have been developed for use with SCED but these are often muddled by the presence of time when analyzing data which we address directly in Chapter 8. However, if you just simply can’t wait until Chapter 8, we recommend you check out Dowdy et al. (2021), Moeyaert et al. (2018), and Pustejovsky and Ferron (2017) for in-depth treatments of indices relevant to SCED.
Social significance No discussion of intervention effectiveness would be complete without considering social significance52, at least by applied53 behavior analytic standards54. This is highlighted by Baer et al. (1987) comment that “Almost every successful study of behavior change ought to routinely present two outcomes—a measure of the changed target behaviors, of course, and a measure of the problem displays and explanations that have stopped or diminished in consequence” (p. 322). The former 50 This is in quotes because this doesn’t always require vision. See Johnson (2016) for a creative way that one blind astrophysicist uses audio to analyze data. Auditory analysis as an alternative to visual analysis, anyone? 51 Recall our previous discussion that interpretations of effects include a consideration of relevant context, astutely noted by Baer et al. (1968), “In evaluating whether a given application has produced enough of a behavioral change to deserve the label, a pertinent question can be, ‘How much did that behavior need to be changed?’ Obviously, that is not a scientific question, but a practical one” (p. 96). 52 Social significance has or is also referred to as social validity (Kazdin, 1977; Wolf, 1978) or a therapeutic criterion (Risley, 1970). 53 Although we attempted to orient this book to be of interest to a wide variety of behavior analysts and settings, our use of the “applied” qualifier was intentional here. An interest in functional relations is apparent across the continuum of behavior analysis, but this interest is often pursued with special consideration for social significance in the applied branch. This is highlighted by Baer et al. (1968)’s specification that, “. . .the behavior, stimuli, and/or organism under study are chosen because of their importance to man and society, rather than their importance to theory” (p. 92). 54 An interest in social significance is also apparent in fields outside of behavior analysis, too. For example, Horner et al. (2005) included evaluation of social validity as an SCED quality indicator in special education.
Just how good is my intervention? Statistical significance, effect sizes, and social significance
127
measure seemingly refers to indices of effect as we have discussed in the preceding section, the latter seemingly refers to evaluations of social significance. Considering both are necessary because if our interventions are “. . .socially invalid, it can hardly be effective, even if it changes its target behaviors thoroughly and with an otherwise excellent cost-benefit ratio; the social validity is not sufficient for effectiveness but is necessary to effectiveness” (Baer et al., 1987, p. 323)55. Relatedly, even if there were all the love for an intervention in the world, the placebo effect (Benedetti, 2009) highlights how social significance in isolation of a robust effect size is not indicative of an intervention worth pursuing. An example may help drive this point home. Consider a situation in which a researcher develops a new intervention with the aim of decreasing the frequency of self-injurious behavior (SIB). The researcher may find statistical significance and a large effect size, but social significance may be unacceptable if SIB decreased from 75 to 20 per hour56. The implication here isn’t that such outcomes aren’t meaningful or that the intervention should be discarded, but considering social significance is also a necessary component when interpreting intervention effects. Of course, you likely are no stranger to the relevance of social validity when you recommend and select interventions57. As such, we elected to provide only a brief discussion of this topic given the likely training you have received on this topic and availability of many relevant behavior analytic writings. So how do we measure social significance? Social significance has historically been described (Wolf, 1978) along three areas which each present 55
For an eye-opening and mind-blowing example of the importance of both, look no further than Project Follow Through—we refer inquiring minds to Engelmann (2007) and Watkins (1997) for all the juicy details. 56 To give you another way of thinking of the necessity of social significance, consider that “. . . plenty of true things aren’t important. “People crash their cars more often when you blindfold them.” “People have a hard time sleeping when you play Insane Clown Posse really loud.” “People like receiving $5 more than they like getting kicked in the head.” I’m sure all of these hypotheses could score you big honkin’ effect sizes. All of them are true. None of them are important.” (Mastroianni, 2022). 57 After all, the necessity of social validity has been alluded to or directly addressed early in the development of the applied branch of our field (Kazdin, 1977; Wolf, 1978), its role clarified as it relates to defining characteristics of applied behavior analysis (Baer et al., 1968), and is mentioned in the latest iteration of the Behavior Analyst Certification Board’s Task List (2017), “Recommend intervention goals and strategies based on such factors as client preferences, supporting environments, risk, constraints, and social validity” (p. 5).
128
Statistics for Applied Behavior Analysis Practitioners and Researchers
an opportunity for measurement. These three areas are the behavior that is the focus of the intervention (goals), the intervention itself (procedures), and the outcomes of the intervention (effects). Historically, evaluating social validity has occurred via either subjective evaluation or social comparisons. Subjective evaluation just means soliciting opinions, thoughts, and feedback from consumers, usually through questionnaires, surveys, or interviews. In contrast, using social comparisons to measure social validity just means collecting normative data from peers to use as a means to compare outcomes achieved for consumers. Social validity has arguably been the central focus of applied behavior analysis (ABA) and a hallmark differentiator of ABA from experimental analysis of behavior58. Because of the importance of social validity to ABA, many authors have thoroughly described and discussed social validity measures such that any discussion on our part would be repetitive with these sources (e.g., Carter & Wheeler, 2019; Cooper, Heron, et al., 2019; Fuqua & Schwade, 1986; Kazdin, 2020, among others). Further, researchers have more recently begun to integrate the recipients of behavior-change programs into evaluations (see Hanley, 2010), leverage qualitative methods (see Burney et al., 2023), and to more rigorously use maintenance measures (e.g., Sigurdsson & Austin, 2006) to evaluate social significance. On par with the rest of this chapter, there likely is no perfect social significance measure or cutoff scores that are appropriate for all behavior analysts, everywhere, and under all circumstances. Ironically, despite the purported centrality of social validity to “doing ABA,” a relatively small percentage of behavior analytic publications actually report it (approximately 12%20%; Carr et al., 1999; Ferguson et al., 2019; Kennedy, 1992). One reason for this might be that the construct of social validity and how it has been evaluated has predominantly been subjective59,60. Although subjective data may not be reliable or always valid (which can hold for objective data, too), 58 We’re not saying that EAB does not have social validity or that EAB researchers do not care about social validity. It often just isn’t the primary focus or benchmark for evaluating the utility of EAB research. 59 Wolf (1978) pointed this out when he noted, “Unfortunately, that sounded slightly subjective to me. And subjective criteria have not been very respectable in our field” (p. 203). 60 A more thorough consideration of potentially relevant variables for the low levels of social validity in behavior analytic publications is outside of our scope here, but we direct readers to Carr et al. (1999) and Ferguson et al. (2019) for related discussions.
Just how good is my intervention? Statistical significance, effect sizes, and social significance
129
surely, that shouldn’t preclude pursuits to identify best-practice approaches to measure and evaluate social significance. We look forward to people smarter than us figuring out how to systematically measure social validity in a quantitative way such that future editions of this book can include summaries around the contexts and data types for which different social validity measures are most appropriate.
Chapter summary This chapter discussed the three lenses through which intervention outcomes and effects are often evaluated—statistical significance, effect size, and social significance. When evaluating statistical significance, the most common method involves formulating a null hypothesis, selecting a significance level, and comparing an obtained p value to the specified significance level. If the p value is less than the significance level, the null hypothesis is rejected and the results are tacted as “statistically significant.” More recently, consensus seems to be that best practice is to report the exact p value rather than reporting on whether or not a p value met an arbitrary cutoff. Lastly, we continue to emphasize that p values from NHST don’t tell you about the reliability or generality of outcomes or effects, nor their practical importance. Effect sizes provide a quantitative measure of the magnitude of difference between two or more datasets or around the relationship between two or more variables within our dataset. Three broad effect sizes categories include risk estimates, group difference indices, and strength of association indices. The specific effect size measure you choose will be influenced by the type of data you’re working with and the question you are trying to answer. Lastly, to interpret effect sizes will require you to review existing qualitative guidelines specific to the chosen effect size measure, the context and goals of the study, and effect sizes that past behavior analysts have obtained using similar interventions and with similar clients. Last, but certainly not least, behavior analysts have always placed the most importance on considering social significance to evaluate outcomes and effects. Past authors have emphasized measuring social significance around the goals of intervention, the procedures used within an intervention, and the overall size of behavior change. As with the above, the larger context and past findings will influence how we
130
Statistics for Applied Behavior Analysis Practitioners and Researchers
interpret any measures of social significance we collect. Ironically, this area is also the area most ripe for creative and inventive research around robust, consistent, reliable, and valid quantitative measurement of social validity.
References Abbott, M. L. (2017). Using statistics in the social and health sciences with SPSS® and Excel®. John Wiley & Sons, Inc. Aberson, C. L. (2010). Applied power analysis for the behavioral sciences. Routledge. Avicenna (Ibn S¯ın¯a) (1037). (1892). Iˇsa¯ r¯at, al-Iˇsa¯ r¯at wa-t-tanb¯ıh¯at, edited by J. Forget, Brill. Baer, D. M., Wilf, M. M., & Risley, T. R. (1987). Some still-current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 20(4), 313327. Baer, D. M., Wolf, M. M., & Risley, T. R. (1968). Some current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 1(1), 9197. Available from https://doi.org/ 10.1901/jba.1968.1-91. Behavior Analyst Certification Board. (2017). BCBA task list (5th ed.). https://www.bacb.com/ wp-content/bcba-task-list-5th-ed. Benedetti, F. (2009). Placebo effects: Understanding the mechanisms in health and disease. Oxford University Press, ISBN: 978-0-19-884317-7. Berkson, J. (1938). Some difficulties in interpretation encountered in the application of the chisquare test. Journal of the American Statistical Association, 33(203), 526536. Available from https://doi.org/10.1080/01621459.1938.10502329. Branch, M. N., & Pennypacker, H. S. (2013). Generality and generalization of research findings. In G. J. In, W. V. Madden, T. D. Dube, G. P. Hackenberg, Hanley, & K. A. Lattal (Eds.), APA handbook of behavior analysis (Vol. 1, pp. 151175). Washington, DC: American Psychological Association, Methods and principles. Burney, V., Arnold-Saritepe, A., & McCann, C. M. (2023). Rethinking the place of qualitative methods in behavior analysis. Perspectives on Behavior Science. Advance online publication. Available from https://doi.org/10.1007/s40614-022-00362-x. Carr, J. E., Austin, J. L., Britton, L. N., Kellum, K. K., & Bailey, J. S. (1999). An assessment of social validity trends in applied behavior analysis. Behavioral Interventions, 14(4), 223231. Available from https://doi.org/10.1002/(SICI)1099-078X(199910/12)14:4 , 223::AID-BIN37 . 3.0. CO;2-Y. Carter, S. L., & Wheeler, J. J. (2019). The social validity manual: Subjective evaluation of interventions (2nd Ed.). Academic Press. Cohen, J. (1969). Statistical power analysis for the behavioral sciences. Academic Press. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd Ed.). Lawrence Erlbaum Associates, Publishers. Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 13041312. Available from https://doi.org/10.1037/0003-066X.45.12.1304. Cohen, J. (1994). The earth is round (p ,.05). American Psychologist, 49(12), 9971003. Available from https://doi.org/10.1037/0003-066X.49.12.997. Cooper, H., Hedges, L. V., & Valentine, J. C. (2019). The handbook of research synthesis and meta-analysis (3rd ed.). Russell Sage Foundation.
Just how good is my intervention? Statistical significance, effect sizes, and social significance
131
Cooper, J. O., Heron, H. T., & Heward, W. L. (2019). Applied behavior analysis (3rd ed.). Pearson. Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and metaanalysis. Routledge. Dowdy, A., Peltier, C., Tincani, M., Schneider, W. J., Hantula, D. A., & Travers, J. C. (2021). Meta-analysis and effect sizes in applied behavior analysis: A review and discussion. Journal of Applied Behavior Analysis, 54(4), 13171340. Available from https://doi.org/10.1002/jaba.862. Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge University Press. Engelmann, S. (2007). Teaching needy kids in our backward system: 42 years of trying. ADI Press. Ferguson, C. J. (2009). An effect size primer: A guide for clinicians and researchers. Professional Psychology: Research and Practice, 40(5), 532538. Available from https://doi.org/10.1037/ a0015808. Ferguson, J. L., Cihon, J. H., Leaf, J. B., Van Meter, S. M., McEachin, J., & Leaf, R. (2019). Assessment of social validity trends in the Journal of Applied Behavior Analysis. European Journal of Behavior Analysis, 20(1), 146157. Available from https://doi.org/10.1080/ 15021149.2018.1534771. Fisher, W. W., Luczynski, K. C., Blowers, A. P., Vosters, M. E., Pisman, M. D., Craig, A. R., Hood, S. A., Machado, M. A., Lesser, A. D., & Piazza, C. C. (2020). A randomized clinical trial of a virtual-training program for teaching applied-behavior-analysis skills to parents of children with autism spectrum disorder. Journal of Applied Behavior Analysis, 53(4), 18561875. Available from https://doi.org/10.1002/jaba.778. Fleiss, J. L. (1996). Measures of effect size for categorical data. In H. Cooper, & L. V. Hedges (Eds.), Handbook of research synthesis (pp. 245260). Russell Sage Foundation. Fuqua, R. W., & Schwade, J. (1986). Social validation of applied behavioral research. In A. Poling, & R. W. Fuqua (Eds.), Research Methods in Applied Behavior Analysis: Issues and Advances (pp. 265292). Springer. Available from https://doi.org/10.1007/978-1-4684-8786-2_12. Glass, G. V., McGraw, B., & Smith, M. L. (1981). Meta-analysis in social research. Sage Publications. Grissom, R. J., & Kim, J. J. (2005). Effect sizes for research: A broad practical approach. Lawrence Arlbaum Associates. Hanley, G. P. (2010). Toward effective and preferred programming: A case for the objective measurement of social validity with recipients of behavior-change programs. Behavior Analysis in Practice, 3(1), 1321. Available from https://doi.org/10.1007/BF03391754. Hedges, L. V. (1981). Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics, 6(2), 107128. Available from https://doi.org/10.2307/ 1164588. Horner, R. H., Carr, E. G., Halle, J., McGee, G., Odom, S., & Wolery, M. (2005). The use of single-subject research to identify evidence-based practice in special education. Exceptional Children, 71(2), 165179. Available from https://doi.org/10.1177/001440290507100203. Johnson, L. (Feb 18, 2016). Blind astrophysicist listens to stars by turning data into sound. https://www.cbc.ca/news/canada/british-columbia/star-sounds-wanda-diazRetrieved from: merced-ted-1.3452236. Kazdin, A. E. (1977). Assessing the clinical or applied importance of behavior change through social validation. Behavior Modification, 1(4), 427452. Available from https://doi.org/10.1177/ 014544557714001. Kazdin, A. E. (2020). Single-case research designs: Methods for clinical and applied settings (3rd ed.). Oxford University Press.
132
Statistics for Applied Behavior Analysis Practitioners and Researchers
Kennedy, C. H. (1992). Trends in the measurement of social validity. The Behavior Analyst, 15 (2), 147156. Available from https://doi.org/10.1007/BF03392597. Kirk, R. E. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56(5), 746759. Available from https://doi.org/10.1177/ 00131644960560050. Lovaas, O. I. (1987). Behavioral treatment and normal educational and intellectual functioning in young autistic children. Journal of Consulting and Clinical Psychology, 55(1), 39. Available from https://doi.org/10.1037/0022-006x.55.1.3. Mastroianni, A. (Sep 20, 2022). Psychology might be a big stinkin’ load of hogwash and that’s just fine. https://experimentalhistory.substack.com/p/psychology-might-be-a-big-stinkin McHugh, M. L. (2013). The chi-square test of independence. Biochemia medica, 23(2), 143149. Available from https://doi.org/10.11613/bm.2013.018. Moeyaert, M., Zimmerman, K. N., & Ledford, J. R. (2018). Synthesis and meta-analysis of single case research. In J. R. Ledford, & D. L. Gast (Eds.), Single case research methodology: applications in special education and behavioral sciences (3rd ed., pp. 393416). Routledge. Molella, A. (Feb 8, 2012). How the phrase ‘the best thing since sliced bread’ originated. The Atlantic. Retrieved from: https://www.theatlantic.com/health/archive/2012/02/how-the-phrase-thebest-thing-since-sliced-bread-originated/252674/. Motulsky, H. (2018). Intuitive biostatistics: A nonmathematical guide to statistical thinking (4th Ed.). Oxford University Press. Murphy, K., Myors, B., & Wolach, A. (2012). Statistical power analysis: A simple and general model for traditional and modern hypothesis tests. Routledge. Pustejovsky, J. E., & Ferron, J. M. (2017). Research synthesis and meta-analysis of single-case designs. In J. M. Kaufman, D. P. Hallahan, & P. C. Pullen (Eds.), Handbook of special education (2nd ed., pp. 168186). Routledge. Risley, T. R. (1970). Behavior modification: An experimental-therapeutic endeavor. In L. A. Hamerlynck, P. O. Davidson, & L. E. Acker (Eds.), Behavior modification and ideal mental health services (pp. 103127). University of Alberta Press. Rosenthal, J. A. (1996). Qualitative descriptors of strength of association and effect size. Journal of Social Service Research, 21(4), 3759. Available from https://doi.org/10.1300/J079v21n04_02. Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44(10), 12761284. Available from https://doi. org/10.1037/0003-066X.44.10.1276. Rosnow, R. L., & Rosenthal, R. (2003). Effect sizes for experimenting psychologists. Canadian Journal of Experimental Psychology, 57(3), 221237. Available from https://doi.org/10.1037/ h0087427. Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47(10), 11731181. Available from https:// doi.org/10.1037/0003-066X.47.10.1173. Sidman, M. (1960). Tactics of scientific research: Evaluating experimental data in psychology. Authors Cooperative, Inc. Publishers. Sigurdsson, S. O., & Austin, J. (2006). Institutionalization and response maintenance in organizational behavior management. Journal of Organizational Behavior Management, 26(4), 4177. Available from https://doi.org/10.1300/J075v26n04_03. Sullivan, G. M., & Feinn, R. (2012). Using effect size—or why the p value is not enough. Journal of Graduate Medical Education, 4(3), 279282. Available from https://doi.org/10.4300/JGME-D12-00156.1.
Just how good is my intervention? Statistical significance, effect sizes, and social significance
133
Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical Science, 6(1), 100116. Available from https://doi.org/10.1214/ss/1177011945. Vigen, T. (2015). Spurious correlations: Correlation does not equal causation. Hachette Books. Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond “p ,.05.”. The American Statistician, 73(sup 1), 119. Available from https://doi.org/10.1080/ 00031305.2019.1583913. Watkins, C.L. (1997). Project follow through: A case study of contingencies influencing instructional practices of the educational establishment. Cambridge Center for Behavioral Studies. White, G. W., Mathews, M., & Fawcett, S. B. (1989). Reducing risk of pressure sores: Effects of watch prompts and alarm avoidance on wheelchair push-ups. Journal of Applied Behavior Analysis, 22(3), 287295. Available from https://doi.org/10.1901/jaba.1989.22-287. Wolf, M. M. (1978). Social validity: The case for subjective measurement or how applied behavior analysis is finding its heart. Journal of Applied Behavior Analysis, 11(2), 203214. Available from https://doi.org/10.1901/jaba.1978.11-203.
CHAPTER
6
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it. Alan Perlis
Introduction Description, prediction, and control are often touted as the hallmark levels of scientific endeavors (e.g., Cooper et al., 2020; Moore, 2010). When researchers first begin to study some phenomenon, they often start by describing what exactly it is they observe and the relations between those observations. Once they have gathered enough observations such that the relations between observations become reliable, they can begin to predict when, and under what conditions, certain events will occur. But, as reminded via the often-noted phrase “correlation does not equal causation,” it is possible to predict that something will occur with tolerable accuracy or precision without knowing whether one event truly causes—plays a controlling role—in the presence or degree to which another event occurs. To this point in the book, we have primarily focused on various ways of describing our observations via the set of verbal behaviors called statistics. In these realms, we learned about the unique and precise impact that numerical descriptions of our observations can have on listeners. We also discussed measures (descriptions) of central tendency to efficiently and precisely communicate general patterns of behavior-environment relations as well as measures (descriptions) of variability within our datasets to efficiently and precisely communicate how variable behavior-environment relations might be for a client or participant. Finally, in the last chapter, we discussed how behavior analysts can describe the effect an intervention has on behavior and Statistics for Applied Behavior Analysis Practitioners and Researchers. DOI: https://doi.org/10.1016/B978-0-323-99885-7.00009-X © 2023 Elsevier Inc. All rights reserved.
136
Statistics for Applied Behavior Analysis Practitioners and Researchers
how behavior analysts can use that information to predict behavior change following the same intervention for one of their clients or participants. In this chapter, we turn our attention to some of the complexity of human behavior and that third level of science—control.
Situating this chapter in the broader analytic landscape What does it mean to control something? The function of the verbal stimulus “control” likely differs for different people based on their learned history. But, at least within the behavior science community, “control” has been used in at least three ways (Table 6.1). The first is within a larger philosophy of science context (top row, Table 6.1). To theoretically account for the control of behavior means that we can use past behavior analytic research to offer a plausible explanation for when we might observe an increase, decrease, or the maintenance of some pattern of responding. Here, we are talking about hypothetically controlling two broad things: (1) the Table 6.1 Three uses of “control” in behavior science. Area
Description
Degree of precision
Degree of certainty
Theoretical
Generalized description of behavioral processes; a position is taken on the total set of variables that can play a causal role in a behavior occurring.
Low. Typically vague and broad, meant to describe the generalized processes of environment-behavior relations to plausibly account for observed patterns of behavior.
Low-to-moderate. Typically describe how an operant/ respondent approach might account for behavior. Might be the most plausible explanation among available alternatives but certainty is unknown until tested directly.
Practical/ experimental
Explicit description of environmental variables assumed to play a causal role in behavior occurring, and exactly how manipulating those variables will lead to behavior change.
Moderate-to-high. Typically involves visual display of how an environmental manipulation leads to an increase, decrease, or maintenance of responding. Quantitative descriptions make the behavior-environment relations direct and explicit.
Moderate-to-high. Certainty increases with more replications of similar directional and degree of differences in behavior between the experimental and “control” conditions.
Analytic/ statistical
Explicit description of potential confounding variables and predictions around the influence those variables have on behavior. Used with variables that cannot be manipulated empirically.
Moderate-to-high. Typically involves mathematically precise claims about the size and direction of influence that an uncontrolled variable might have on responding.
Low-to-high. Certainty increases with (1) more data; (2) more accurate data; (3) greater coverage of the range the uncontrolled variables can take; and (4) uncontrolled variables have logically known or plausible functional relation with behavior.
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
137
presence of a response class and (2) the environmental conditions that theoretically are likely to influence that response class. This often makes theoretical descriptions vague and broad (as they are meant to be). And, though they might be the most plausible explanation among available alternatives, certainty around their accuracy is unknown until tested directly. Theoretical extensions become practical when we get to the second use of the term control in behavior science: demonstrating the experimental control of a specific response topography (second row, Table 6.1). Note that we have now expanded the precision (and subsequent set) of things that a behavior analyst is controlling. Practically, the behavior analyst must have identified and controlled the environmental conditions suspected to influence responding (e.g., implement a behavioral intervention with high fidelity), they must have observed a change in responding as predicted by those environmental conditions (e.g., reduction in challenging behavior), and they must explicitly create a “control condition” to demonstrate that the pattern of behavior change does not occur in the absence of the independent variables (IVs) being tested (e.g., baseline condition; “no-treatment” control condition). Experimental control is obviously the ideal level of control for making claims about causal relations between the environment and behavior. But, unless you have some magic sauce the rest of us do not, as mere humans we are unable to control everything in the universe when conducting an experiment. And, in applied settings, we have more limited control of variables that may influence behavior relative to laboratory settings. Nevertheless, many questions arise where we might have or are able to collect some data, we want/need an answer to inform a decision, and the experimental control of relevant variables is not practically or physically possible. For example, questions that fit this description might be: Is a behavior analyst better at developing programs for clients of a certain age and with a certain behavioral profile? How does the number of programs we are working on with Lingyun influence his rate of acquisition? Are we truly seeing more progress from clients at Clinic A compared to Clinic B, or do they have truly different types of clients? And, how do different components of our current staff training translate to on-the-job performance one week, one month, and one year after training? And, how does that interact with who their supervisors are?
138
Statistics for Applied Behavior Analysis Practitioners and Researchers
In the abovementioned situations, we, fortunately, don’t have to throw our hands up in ignorance, flip a coin, and let random chance determine how we proceed. Less often discussed (or used) in published articles is how behavior analysts might still get an answer to their questions when they have some data on a topic but are unable to control for all potential variables that might help answer our question. For example, variables with a known influence on human behavior and that cannot be manipulated or controlled experimentally include age, preexperimental or preintervention learning history, verbal repertoire, sociocultural background, and physiological states and conditions (e.g., comorbid disorders, currently sick). In these situations, analytical or statistical control is often the most effective (and only available) tool in behavior analysts’ tool belt. This chapter focuses on this last area.
How might we control for nonempirically controlled variables? At least two approaches can be used to handle nonempirically controlled (NEC) variables. One method is to simply ignore them. To the extent that none of those variables play a significant enough role in an experiment or intervention, ignoring NEC variables is unlikely to impact the patterns of environment-behavior relations you observe and the conclusions you can draw. However, ignoring potential confounding variables has the drawback that—without measuring or manipulating them directly—you can never be sure as to whether they did influence behavior outside your awareness. Further, if one of those NEC variables changes substantially at a future point in time, you likely cannot make predictions about how behavior will change. This may create significant limitations to the results of your experiment or lead to shifts in behavior that teachers, therapists, and parents were not expecting. A second method for handling confounding variables is to simply collect ongoing and accurate measurements of the naturally varying presence of potentially influential variables across an experiment or intervention. Over time, as the amount of data collected increases, the behavior analyst can use mathematical techniques to identify the likely size, direction, and pattern of the relationship between the potentially influential variables and the behavior of interest (bottom row, Table 6.1). It is this last method of controlling for potentially influential variables that is the focus of this chapter.
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
139
Summarizing our situation Biological life involves an intricate and complex set of interactions between the totality of organisms embedded within a dynamically changing environment. Though behavior analysts do their best to control all factors that can plausibly influence a response class, experimental control is not always possible for all variables’ known (or with probable) relation to behavior. But just because we cannot control for these variables experimentally does not mean we cannot measure, track, and analyze the extent to which these variables are likely to influence behavior. In this chapter, we discuss how behavior analysts can control (analytically) for these (experimentally) uncontrolled variables—even if it is less ideal than experimental control. To do this, we get to start talking a bit more seriously about models of behaviorenvironment relations. Such talk shouldn’t be cause for worry because, similar to the term “statistics,” you are likely already playing with models of behavior-environment relations even if you have not talked or thought about it in that way.
Models of behavior At the most basic level, models can be defined as a representation of a thing, physical structure, or physical process (Merriam-Webster, 2022) that are “smaller” than the thing they are meant to describe. Here, “smaller” might be along the lines of physical proportions (e.g., a model plane that won’t quite get us to the Bahamas) or the inclusion of fewer details (e.g., the three-term contingency). Models can take on many different forms and the behavior of modeling may have many different functions (for a full review of this topographical and functional landscape for models in behavior science). In behavior analysis, models are often used to efficiently communicate about known or likely behavior-environment relations without having to include data and descriptions about every detail of the environment antecedent and consequent to the behavior of interest for all observations we have ever made. Practicing behavior analysts already use models of behavior. For example, consider the statement, “If you haven’t paid attention to Bamidele in a while, she is likely to start acting silly until you give her attention.” This statement is a model because the speaker is describing a physical process (positive reinforcement) and the speaker is using
140
Statistics for Applied Behavior Analysis Practitioners and Researchers
fewer details (21 words in total) than physically occurred on the numerous occasions where observations were made and the data were collected to derive that functional relation. More generally, the previous statement is one example of the model tacted as a three-term contingency (i.e., SD-B-SR) where a physical system is being described in fewer details than the totality of physical events it describes. Often, these models used and contacted by behavior analysts are textual-vocal or, perhaps, graphical models (Moxley, 1982). You may recall from Chapter 1 that statistics are a branch of mathematics whose topic is the collection, analysis, interpretation, and presentation of aggregate quantitative data. Models simply create a framework for analyzing, interpreting, and presenting aggregate data. Statistical models are, therefore, similar to textual-vocal models in that they both aggregate patterns of behavior-environment relations and allow the model user to understand how they might predict and control behavior by changing the environment. The difference is that one of those models uses words to aggregate those patterns (textual-vocal) and the other uses mathematics and equations (statistical). Statistical models have several practical advantages over textualvocal models (Dallery & Soto, 2013). These include greater precision in describing operant and respondent contingencies; more precise predictions about the relations between behavior and the environment; the ability to falsify claims about behavior-environment relations; better allow researchers and practitioners to unify observations of diverse phenomena; and can easily include and analyze the influence of dozens or hundreds of variables on behavior. Stated differently, quantitative models of behavior-environment relations make more precise predictions about what might happen based on our assumptions of everything we believe is controlling behavior (and have data for). In turn, we can examine how close our predictions are, identify where we missed the mark, and improve our predictions and understanding about the variables controlling behavior. Similar to many topics in the book thus far, these models can come in many flavors.
Regression models Beginning with the familiar The generalized matching equation is one example of a popular statistical model of behavior-environment relations in behavior analysis
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
141
(Baum, 1974; McDowell, 1989). The prototypical matching law experiment involves exposing a rat or pigeon to dozens of experimental sessions under varying contingency arrangements (Fig. 6.1). In each session, the organism is placed in an operant chamber where two concurrent variable-interval (VI) schedules are in effect. For example, one lever or key might be a VI-60-second schedule, whereas the second lever or key might be a VI-300-second schedule. In this example, the
Figure 6.1 Example datasets from a prototypical matching law experiment. Each panel shows the relative response rates from one concurrent schedule ratio. Note: each data point represents an aggregate of responding for that individual for that single session (i.e., it is a descriptive statistic).
142
Statistics for Applied Behavior Analysis Practitioners and Researchers
first lever or key allows the organism to contact five times the amount of reinforcement compared to the second lever or key1. Stated succinctly, the left:right lever ratio of reinforcement is 5:1. The rat or pigeon then contacts many sessions with this concurrent schedule arrangement until responding is stable. Once stable, the ratio of left: right programmed reinforcement would change to, say, 10:1. The organism contacts this arrangement until responding is stable, the ratio is changed again to, say, 1:1 until responding is stable, then 1:5 until stable, and finally 1:10. At the end of all this, we can calculate the correct measure of central tendency (Chapter 3) to get a single data point that represents the rate of stable responding in that condition (marker at far right of each panel in Fig. 6.1). Note here that we are already at one level of aggregation per condition—statistics! Fig. 6.2 shows how the aggregated data from such an experiment is often presented for analysis and interpretation. Here, the behavior analyst takes those calculated average rates of responding and then uses computer software (e.g., Excel, Prism, Python, R) to create a scatterplot of the data. Once created, the behavior analyst can add the “line of best fit” to those data2. If you recall from the math classes you may have taken when you were younger, the equation for a line is: y 5 mx 1 b:
(6.1)
Here, y are the values corresponding to the y-axis, x are the values corresponding to the x-axis, m is the slope (i.e., how much y changes for every one unit change in x), and b is the y-intercept (i.e., the y-value when x 5 0). Getting the line of best fit is useful because the generalized matching equation is a linear model (i.e., uses the equation for a line—Eq. 6.1). The only difference is that we swap out the dependent variable (DV) y for the ratio of responding (Bleft/Bright) and we swap out the IV x for the ratio of reinforcement (SR 1 left/SR 1 right). The resulting equation
1
With a VI-60-second schedule the organism will likely contact 60 reinforcers per hour (1 reinforcer per minute). With the VI-300-second schedule (1 reinforcer every 5 minutes), the organism will likely contact 12 reinforcers per hour (60/5 5 12). Left key:right key 5 60:12 5 5:1. 2 For those unfamiliar with how to do this in your favorite data software program, a quick Google search of “How to add a linear regression line in [program]?” should get you what you need (while swapping out [program] for the program you use; e.g., Excel, Prism). For the generalized matching equation, specifically, see the excellent tutorial by Reed (2009).
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
143
Figure 6.2 The spreadsheet shows the average rates of stable responding aggregated across the five conditions in Fig. 6.1. The graph shows one common method to visually display those data: a scatterplot with the line of best fit determined via linear regression.
then becomes: Bleft SR1 left 5m 1 b: Bright SR1 right
(6.2)
This makes m and b interpretable from a behavioral perspective. m (the slope) describes how sensitive the organism is to detecting changes in reinforcement ratios (as measured by response ratios)3. And, b describes a 3
Readers interested in what all goes into organismic sensitivity to changing reinforcement schedules might get started with Davison and Tustin (1978), Madden and Perone (2013), and McCarthy and Davison (1980). Then, let the wonderful citation train be your tour guide.
144
Statistics for Applied Behavior Analysis Practitioners and Researchers
Table 6.2 Types of models and their characteristics. Model type
Number
Dependent
independent
variable
variables
data type
Example from behavior analysis
1
Univariable linear regression
1
Continuous
Generalized matching equation
2
Univariable nonlinear regression
1
Continuous
Discounting, demand
3
Multivariable linear regression
21
Continuous
Concatenated matching law
4
Multivariable nonlinear regression
21
Continuous
Additive and multiplicative discounting; MPR
5
Univariable ordinal regression
1
Ordinal
Open vs. closed economy to predict PSPA Ranks; AUCORD
6
Multivariable ordinal regression
21
Ordinal
Economy Type and Session Setting on PSPA ranks
7
Univariable classification
1
Ordinal/ nominal
Effect of newsletter on political action (Schroeder et al., 2004)
8
Multivariable classification
21
Ordinal/ nominal
Predictors of “yes” or “no” intervention effect using visual analysis (Kahng et al., 2010)
MPR, Mathematical principles of reinforcement (e.g., Killeen, 1994; Killeen & Sitomer, 2003); PSPA, paired stimulus preference assessment; AUCORD, area under the curve using delays or probabilities ordinally (e.g., Borges et al., 2016).
general preference for one of the responses not captured by the measured reinforcement schedules4. The spreadsheet in Fig. 6.2 shows the slope and y-intercept for the data in the figure. Inserting those values, the researcher could say, “For every one unit change in reinforcement ratio, we expect a 0.97 change in response ratios.” More generally, we have used a mathematical model to account for (1) how changing schedules of concurrent reinforcement influence changes in behavior; (2) how sensitive this specific organism is to changing schedules of reinforcement; and (3) general preference for one response that we did not control for in our experiment. Generalizing to a bigger picture If you feel comfortable with what behavior analysts are doing when playing with the matching law (i.e., building a linear regression model interpretable relative to behavioral processes), you are in great shape to understand statistical models more generally. Recall that “statistics” just means the analysis, interpretation, and presentation of aggregate quantitative data. The top row in Table 6.2 shows the general features 4 One final technical note for the generalized matching equation is that—to get the data to be well described by a straight line—researchers take the logarithm of the reinforcement ratios and the response ratios before fitting the regression line (see Baum, 1974 for why this is useful). For our example, we simply logged the axes for similar effect.
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
145
of the type of statistical model that we call the generalized matching equation. In the generalized matching equation, we have one IV (i.e., reinforcer ratios where each plotted ratio is an aggregate from many sessions), we are predicting a continuous DV (i.e., response ratios where each plotted ratio is an aggregate from many sessions), and we use a line to describe a relationship between reinforcer ratios and response ratios. In the jargon of statistics, the generalized matching equation is a univariable linear regression model. “Univariable” because we have one IV, “linear” because we are using a line to describe the relationship between the IV and DV, and “regression” because the DV is continuous. As described above, linear models are useful because they are easy to fit to our data and are also very easy to interpret. But life is complex and data often does not fall on simple lines. When the relationship between variables does not fall along a straight line, we call it (quite logically) a nonlinear relationship. And, when we are still using a single IV to describe or to make predictions about a single DV, we again use the univariable label. Thus, when we are interested in talking about a single IV describing a single continuous DV but via a nonlinear relationship, we call such statistical models univariable nonlinear regression models. The second row in Table 6.2 highlights two nonlinear statistical models in behavior analysis: discounting (e.g., Rachlin, 2006) and demand (e.g., Hursh & Silberberg, 2008). In both models, there is one IV (delay, probability, effort, or social relatedness for discounting; price to contact a reinforcer for demand) and a continuous (or near continuous) DV (reinforcer value for discounting; reinforcers consumed for demand). Also for both models, the relationship between the two variables is nonlinear (Fig. 6.3). For discounting, the relationship between the IV and DV is often hyperbola-like (e.g., Rachlin, 2006; left panel Fig. 6.3). For demand, the relationship between the IV and DV is often exponential (e.g., Hursh & Silberberg, 2008; right panel Fig. 6.3)5,6. Bringing it all 5 For the adventurous, one equation that describes the nonlinear relationship between IV and DV in discounting is V 5 A/(1 1 γX) where V represents reinforcer value (i.e., how much behavior the reinforcer will maintain), X is the value the IV takes (e.g., delay in days, odds against), and γ is a free parameter that describes how quickly the DV changes as the IV changes (i.e., how reinforcer value changes with delay, odds against, effort, or social relatedness). 6 For the adventurous, one equation that describes the nonlinear relationship between IV and DV in demand is log10Q 5 log10Q0 1 k(e-aQoC 2 1) where Q represents reinforcers consumed, Q0 represents the reinforcers consumed when the price is 0 (i.e., free), k is a constant that represents the range of consumption, and C is reinforcer cost (i.e., the IV), and a describes how quickly the DV changes as the IV changes (i.e., how reinforcer consumption changes with cost).
146
Statistics for Applied Behavior Analysis Practitioners and Researchers
Figure 6.3 Two popular statistical models in behavior analysis that show nonlinear relationships between the independent variable and the dependent variable.
together in the succinct language of statistical jargon, discounting and demand are univariable nonlinear regression models. The generalized matching equation, discounting equation, and demand equation are all examples of univariable statistical regression models. They have one IV that we are using to predict or describe changes in one DV. But most behavior is likely influenced and controlled by many different variables, that is, the multiple control of behavior is likely the rule rather than the exception (Michael et al., 2011; Skinner, 1957). Statistical models of behavior-environment relations that describe how multiple variables’ influence behavior are called (quite logically) multivariable models. And, as before, we can use either a simple line to describe that relationship (linear model) or use something other than a simple line to describe that relationship (nonlinear model). The third row in Table 6.2 highlights the features and one example of a multivariable linear regression model in behavior analysis: the concatenated matching law (e.g., Davison, 1982; Davison & McCarthy, 1988; Hunter & Davison, 1982). Recall that the generalized matching equation predicts that the ratio of rates of two behaviors is equal to the ratio of reinforcement rates contacted by each behavior, with a free parameter slope accounting for sensitivity to changes in reinforcement rates, and the free parameter y-intercept describing bias for one response or the other. But, we also know that behavior is influenced by the amount of reinforcer at each reinforcer delivery as well as the immediacy with which the organism contacts reinforcement after responding (e.g., Rodriguez & Logue, 1986; Vollmer & Bourret, 2000).
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
147
The concatenated matching law attempts to account for this by adding amount (A) and immediacy (I) into the statistical model: Bleft Rleft Aleft Ileft log 5 m 3 log 1 a 3 log 1 i 3 log 1 b: Bright Rright Aright Iright (6.3) Just as before, we have the rate of reinforcement (R), sensitivity to changing reinforcement rates (m), and bias (b) in the equation. We have also now added the ratio of amount (A), sensitivity to changing ratios of reinforcer amounts (a), ratio of reinforcer immediacy (I), and sensitivity to changing immediacy ratios (i). Multiple variables (i.e., rate, amount, and immediacy 5 multivariable) aggregated over many sessions (i.e., statistical) are being used to predict via a linear equation a continuous DV (i.e., ratio of behavior). Succinctly via statistical jargon: a multivariable linear regression model. The fourth row in Table 6.2 shows the features and some examples of multivariable nonlinear regression models in behavior analysis. Two recently explored models within discounting sought to understand how the delay (one IV) and probability of (a second IV) an outcome combine to influence reinforcer value within discounting studies (i.e., multivariable; e.g., Cox & Dallery, 2016; Vanderveldt et al., 2015). And, just as with discounting previously, there is a nonlinear shape to the relation between reinforcer value (i.e., a continuous variable— regression) with delay and probability. We will leave it to the curious reader to check out those papers if interested in the model details. But there are three main points here. First, multivariable nonlinear regression models are doing the same thing as all the above regression models, just using two or more IVs and an assumed nonlinear relationship between the IVs and a continuous DV. Second, behavior analysts already use these models to quantitatively describe the multiple control of choice, as constrained by known behavioral processes. Third, similar to the regression models above, the estimated parameters associated with each IV gives you a precise estimate of the degree of influence that variable has on behavior (i.e., it is behaviorally interpretable). Rows 5 and 6 of Table 6.2 help us round out the types of regression models behavior analysts are likely to use. As discussed in most chapters to this point, many different types of data exist and each data type
148
Statistics for Applied Behavior Analysis Practitioners and Researchers
should be handled differently based on the logic of numbers. Modeling is no different. Above we talked about regression modeling when our DV is a continuous data type. But sometimes the measure of behavior we want to model is ordinal. For example, item rankings from preference assessments are ordinal and we would need to use a different equation to model these data (e.g., Liu et al., 2017) compared to a continuous DV (e.g., rate of responding). Nevertheless, just like continuous data, ordinal regression models can be linear or nonlinear, and ordinal regression models can involve one predictor variable (univariable) or multiple predator variables (multivariable). One final note here is that using regression models for ordinal data only makes sense if you have enough levels of your ordinal data such that it starts to look and smell similar to continuous data (e.g., the DV can take dozens or hundreds of values). If your ordinal DV does not have enough levels to dress like a continuous variable, you can pivot to models designed for discrete nominal data: classification models.
Classification models As was highlighted in previous chapters, much of the data that behavior analysts collect are not continuous in nature or we might not be able to transform it to a continuous data type. In these instances, we build what are referred to as classification models. This makes logical sense if you step back and think about it. If we want to predict, say, your favorite color, we would be predicting a category or label or class of the total set of stimuli that humans tact as “colors.” Examples from applied behavior analysis might include predicting whether a client needs comprehensive or focused intervention based on the totality of their presenting symptoms, classifying the function of a behavior based on all the environment-behavior data collected, predicting whether a stimulus will or will not function as a reinforcer (or punisher), classifying whether or not staff implemented a step of an intervention procedure correctly, predicting whether or not staff are likely to resign, or classifying whether or not a change in behavior occurred based on visual analysis. Approaches to classification modeling can be discussed relative to at least two characteristics of the things we have collected data on. First, how many IVs do we have that we are using to describe or predict our DV. As with regression models, the same jargon is used to describe whether we are using one IV (univariable) or two or more IVs
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
149
(multivariable) to describe or predict their influence on a discrete DV. Second, whether we are attempting to predict if responding falls into one of two different categories or if responding falls into two or more categories. When our DV has only two (bi) categories or levels, it is called binary classification. When our DV has three or more categories or levels, it is referred to as multinomial classification (“multi” meaning many and nominal meaning “names”; multinomial 5 many names). The seventh row of Table 6.2 highlights the characteristics of univariable classification models. Let’s start simply with binary classification. The rightmost column of row 7 shows one example from the Journal of Applied Behavior Analysis where researchers were interested in examining the relative influence of one IV on a binary ordinal DV. Schroeder et al. (2004) evaluated whether sending business owners a newsletter twice a week (one IV) led the business owners to take political action on environmental issues (“yes, they took action” or “no, they did not”—binary classification). This is also an ordinal DV because “yes, they took action” represents more behavior than “no, they did not take action,” but we do not know how much more behavior they took. So we have a univariable binary classification task in front of us. Arguably, the most popular binary classification model is called a logistic regression model. We know, we know. After all the hoopla above around defining regression vs. classification, it is likely very confusing that the first classification model we talk about is labeled a “regression.” But it makes some sense when we dig into it and we see the output of the model. And it also sets us up for what the many classification models are doing that you will likely use or encounter. Fig. 6.4 shows some data and a visual for what we are doing with the logistic regression model. Our measure of behavior can be either a “0” (did not occur) or a “1” (did occur). Sticking to the univariable model, we also have one IV (e.g., newsletters per week, reinforcers delivered per hour). Because the DV can only take on a “0” or “1” value, a linear function would not make much sense as it could predict values less than “0” or greater than “1” (dashed line in Fig. 6.4). The potential predicted values in the middle are not as concerning because of the magic of numbers and of which logistic regression solves for us quite handily.
150
Statistics for Applied Behavior Analysis Practitioners and Researchers
Figure 6.4 Visual of the logistic regression output.
The solid line with markers in Fig. 6.4 shows the output of a logistic regression model7. Because of the math behind logistic regression, the output values range only between “0” and “1” (solid line in Fig. 6.4). As can be seen in the plot, as the x-values decrease toward zero, the predicted y-values approach “0.” And, as the x-values increase toward the maximum value our IV can take (i.e., 20 in this example), the predicted y-values approach “1.” In turn, we can interpret the output as the probability that the observation will be a “1” based on the value of whatever is represented by our x-axis (i.e., the value of our IV)8. That’s the magic, because we have constrained our output between 0.0 and 1.0, we can interpret the numerical output as a probability. So, as the output gets closer to 1.0, we can interpret it as a higher probability that we will observe whatever ordinal or nominal label we gave to “1.” Translating to behavior analytic contexts, a “1” might mean “the individual emits the target behavior.” So, increasing the amount of the IV increases the probability we observe the target behavior. Quite logical! And, because the output of our model can be interpreted as a 7
For the reader interested in greater mathematical technicalities, we have shifted our x-axis range for ease of demonstration in Fig. 6.4 as IV values for most behavior analytic interventions and experiments cannot be negative. Technically, the 0.5 output is centered at x 5 0 for the generalized logistic regression case. And, software packages handle the transformation behind-the-scenes to make the output interpretable for you. 8 If it makes sense based on the behavior-environment relations you have measured, you can reverse this function so that probability approaches 1.0 as x approaches zero and probability approaches 0.0 as x approaches the max value it can take.
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
151
probability, it can take on any real value between 0.0 and 1.0—a regression! Also, quite logical! So, logistic regression models (and most classification models) give us a probability that a “1” is observed. The next step is to turn that continuous probability value into a discrete prediction of a “1” or a “0.” This is done quite simply by choosing a cutoff probability. Then, any predicted probability above that cutoff value would be classified as a “1” and any predicted probability below that cutoff would be classified as a “0.” The default cutoff is often 0.5 for most statistical software programs and programming packages. If you’re thinking, “That seems quite arbitrary,” then you are 100% right! There is no inherent reason the cutoff point has to be 0.5. You could set it to 0.05, or 0.6, or 0.923487364587643756324. Pick your poison! In the following section, we talk a bit more about how you might change that cutoff value. The point here is that logistic regression uses our IV(s) to make a prediction between 0.0 and 1.0 that your observation falls in one of two categories. And, you can change the cutoff point that leads to discrete labels based on your data and what you’re trying to model. What happens if we have more than one category value that our DV can take? Well, if you followed what we are after with a binary classification task, then you are in great shape as we just do the exact same thing with a twist. And, that twist often comes in one of two flavors: cherry or purple-blast. Only kidding, they are one-vs.-rest or onevs.-one. Fig. 6.5 shows what we are after with each of these approaches. In the one-vs.-rest approach, we create one binary classification model for each label that shows up in our dataset. So, if there are three values our DV takes in our dataset (e.g., classifying errors as indicative of molar win-stay, molecular win-stay, or position bias; see Grow et al., 2011), we create one classification model of whether the DV is molar win-stay or not, one classification model of whether the DV is molecular win-stay or not, and one classification model of whether the DV is position bias or not. The output of each of these models is the probability that the observation is that categorical value. And, to make a final choice, we can simply look at which of the three models has the highest probability, and that becomes our guess. One-vs.-one multinomial classification also leads us to create multiple models and pick the one with the highest probability. The difference here is that, rather than building one model for each discrete
152
Statistics for Applied Behavior Analysis Practitioners and Researchers
Figure 6.5 High-level overview of what multinomial classification models are doing. Please check the online version to view the color image of the figure.
value our DV takes, we create a model for each pairwise combination of values that our DV takes (lower panel in Fig. 6.5). Using the molar win-stay, molecular win-stay, or position bias example from above, we would create one model each for molar win-stay-vs.-molecular winstay, molar win-stay vs. position bias, and molecular win-stay vs. position bias. We can then count the number of pairwise combinations that each category “wins” and, if there’s a tie, the category with the
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
153
higher probability in the head-to-head model is predicted as the “correct” category label. Easy as pie! A question that logically follows the preceding two paragraphs is when we might want to choose the one-vs.-rest or the one-vs.-one approach to multinomial classification. The answer here really comes down to what your data looks and feels like. One-vs.-rest is really nice to use when the measures of each DV value are fairly distinct when we plot them relative to interactions between our IVs (e.g., how they are plotted in Fig. 6.5); we have pretty different amounts of each of the DV values in our dataset (aka we have an imbalanced datasets); or there are a lot of values that our DV takes in our dataset. To drive that last point home, say there are 10 values that our DV takes. Onevs.-rest would require our computer to create 10 models, whereas onevs.-one would require 45 models (remember from Chapter 2, C ðn; rÞ 5 ðr!ðnn!2 rÞ!Þ). This can become computationally time consuming if the number of possible values gets large enough. One-vs.-one is great when there is quite a bit of overlap in our DVs when plotted relative to IVs; we have a similar amount of each of the DV values in our dataset (aka a balanced dataset); and there are few values the DV takes. To close out this section, it is important to note that we briefly discussed logistic regression as one approach to modeling discrete DVs that are either binary or multinomial. As with much of what we have covered throughout the book, this is only one of the many possible approaches to building classification models that exist in the wild. To help build some intuition around what’s possible, consider again Fig. 6.5. In each model, we are essentially trying to figure out the best way to draw a boundary that separates the two or more categories we are interested in differentiating. As you can likely imagine, there are thousands, maybe millions of different ways we can draw the dashed separation boundary. And we don’t always have to use a line; we could draw a nonlinear “squiggly” boundary, or we could draw a circle around one dataset, or an oval, or a triangle, and so forth. The limit to how you think about your data really comes down to your own creativity and ability to implement your idea in whatever software program you have access to. Hopefully, by this point in the chapter, you have a good sense of the basics behind how to model discrete ordinal or nominal data. If your
154
Statistics for Applied Behavior Analysis Practitioners and Researchers
DV is multinomial, choose either the one-vs.-rest or the one-vs.-one strategy based on the characteristics of your data. Convert your data to binary format where the “1” and “0” make logical sense based on your DV. Choose the model you want to use (e.g., logistic regression) from the options available in the software package you prefer. Pour yourself a cup of coffee, mineral water, or your favorite beverage. And, let the software do the heavy lifting to fit the model and spit out an output.
Brief primer on interpreting models To close out this section, we want to first say that the process of building accurate and reliable statistical models truly deserves a book-long treatment in itself9. So, though we believe that everyone should experiment and “try this at home,” be sure you run your final models and outputs past someone with some experience in this area before you go off publishing and making claims that are. . .not quite right. To give you some headway in this direction, we’ll close out this chapter by briefly reviewing what to do with the output that your statistical software spit at you while you were away pouring a cup of Joe. This comes in two phases. First, we can ask how well our model describes our data or makes predictions. Second, if our model performs well, we can ask how much each IV influences our DV.
How good is this model, exactly? The first step to evaluating how good our model is involves analyzing loss/fit10 metric(s). A loss metric is pretty much exactly as it sounds. We have some kind of metric—a measure—of the loss of information that comes by way of using a model for description and prediction as opposed to using the raw data. Stated in a more long winded fashion, the most accurate description of our data is the raw data itself because it is the exact thing we want to describe and predict. But using only raw data comes with all the baggage and imprecision we have discussed to this point in the book. When we create a model of our data, it is unlikely to perfectly fit the raw observations we have made. It’ll be “off” in some way, shape, or form. As a result, we lose information Psst. . .hey, Jason! What are you doing next year?!?! For the modeling pros out there, we appreciate that some people distinguish loss and fit metrics. However, some of the arguments put forth on this distinction are not always convincing and often feel like splitting hairs, we left decided to keep things simple for the sake of an introduction to this material.
9
10
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
155
Figure 6.6 Plots showing how we derive loss metrics for regression analyses via an example of delay discounting.
that was contained in the raw data with the trade-off being increased precision in both the predictions we can make and the estimates of the influence of our IVs on behavior. Fig. 6.6 shows a visual of what we mean here via an example of delay discounting. The leftmost panel shows a prototypical discounting dataset where we have the measured indifference points for a human, rat, or pigeon participant. The middle panel of Fig. 6.6 zooms in to look at just the first five indifference points lasting up to a 0.50 years delay to the larger later reinforcer. Here, it is easier to see how “off” our model is— the gap or the difference—between the observations we made and what the model is predicting (denoted by the arrows). Formally, these are called residuals and we can calculate the residuals by subtracting the model’s predicted value from the observed value. Once we calculate these residuals for each observation in our dataset, we can measure how much information we lost by using our model instead of the raw data: loss metric. The far right panel of Fig. 6.6 shows what is called a residual plot. A residual plot graphs the calculated residuals (y-axis) as a function of our IV which, in this case, is the delay to the larger later reinforcer. Residual plots should look like that shown in Fig. 6.6 where they are approximately evenly distributed around 0.0 residuals and there are not obvious trends in the data. That is, the residuals from fitting our model do not systematically become more negative or more positive as our IV increases. If we do observe systematic changes in our residuals, then we likely need to improve the model before moving forward. Assuming our residuals are evenly distributed and nonsystematic, then we can move into more formal calculations using our residuals.
How well does my regression model fit my data? Table 6.3 shows six commonly used loss/fit metrics for regression models. Most statistical software programming packages will calculate
Table 6.3 Common loss metrics for regression tasks. Name
Steps to calculate
Mathematical
Common equation
Ideal use case
P ðy2y0 Þ2 1 2 P ðy2yÞ2
Linear regression only. Interested in how model compares to using the mean to describe your data.
notation r2 aka VAC
(1) Calculate the residual for each observation. Square each.
5 (y-y0 )2
(2) Add up all the calculated values in step 1.
5
(3) Find the difference for each observation from the mean.
MAE
MSE
RMSE
P
(Step 1) 2 5 y2y
P
(4) Add up all the calculated values in step 3.
5
(5) Divide the result from step 2 by the result from step 4
5 (Step 2)/(Step 4)
(6) Subtract the result from step 5 from 1.
5 1-(Step 5)
(1) Calculate the absolute value of the residual for each observation.
5 abs(y-y0 )
(2) Add up all the calculated values in step 1.
5
(3) Divide the result from step 2 by the total number of observations.
5 (Step 2)/N
(1) Calculate the residual for each observation. Square each.
5 (y-y0 )2
(2) Add up all the calculated values in step 1.
5
(3) Divide the result from step 2 by the total number of observations.
5 (Step 2)/N
(1) Calculate the residual for each observation. Square each.
5 (y-y0 )2
(2) Add up all the calculated values in step 1.
5
(3) Divide the result from step 2 by the total number of observations.
5 (Step 2)/N
(4) Take the square root of the result from step 3.
5
P
P
P
(Step 3)
P
absðy 2 y0 Þ N
(Step 1)
P
ðy2y0 Þ2 N
(Step 1)
(Step 1)
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðStep 3Þ
Linear or nonlinear regression. Interested in how “off” your model is with metric in units of your DV.
Linear or nonlinear regression. Interested in how “off” your model is with greater weight for higher residuals.
qP ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2ffi ðy2y0 Þ N
Linear or nonlinear regression. Interested in how “off” your model is with greater weight of higher residuals, with metric in units of your DV.
AIC
BIC
(1) Calculate the residual for each observation. Square each
5 (y-y0 )2
(2) Add up all the calculate values in step 1.
5
(3) Divide the result from step 2 by the total number of observations.
ðStep 2Þ N
(4) Take the log of step 3 and multiply by number of observations.
5 N 3 log (Step 3)
(5) Multiply the number of free parameters in your model by 2.
5 2K
(6) Add steps 4 and 5 together.
5 (Step 4) 1 (Step 5)
(1) Calculate the residual for each observation. Square each.
5 (y-y0 )2
(2) Add up all the calculate values in step 1.
5
(3) Divide the result from step 2 by the total number of observations.
ðStep 2Þ N
(4) Take the log of step 3 and multiply by number of observations.
5 N 3 log (Step 3)
(5) Add up the number of free parameters in your model.
5K
(6) Multiply step 5 times the log of the total number of observations.
5 (Step 5) 3 log (N)
(7) Add steps 4 and 6 together.
5 (Step 4) 1 (Step 6)
P
P
N log(σ^2 ) 1 2K
(Step 1)
(Step 1)
N log ðσ^2 Þ 1 K 3 logðN Þ
Model comparisons. Use this equation only for least squares regression with normally distributed residuals. Else, log-likelihood equation is appropriate. With fewer than 40 total observations, AICc should be used.
Model comparisons. You want to penalize the number of parameters to greater extent. Use this equation only for least squares regression with normally distirbuted residuals. Else, log-likelihood equation is appropriate.
AIC, Akaike’s information criterion; BIC, Bayesian information criterion; DV, dependent variable; MAE, mean absolute error; MSE, mean squared error; RMSE, root mean squared error; VAC, variance accounted for.
158
Statistics for Applied Behavior Analysis Practitioners and Researchers
these for you. However, for the Excel die-hards, the second column outlines the steps to calculate the metric. And, for the mathematically inquisitive, the third column shows the mathematical notation that coincides with each step and how they combine in the fourth column to form the overall equation. Lastly, for everyone reading this book, the last column shows the typical condition under which each loss metric is likely the best metric for the job. Importantly, note that every loss metric begins by calculating the difference between what our model predicts and the actual data: the residuals. And, again—each assumes the residuals are evenly distributed around zero and nonsystematic. Variance account for (r2) The top row of Table 6.3 shows arguably the most common fit metric used to evaluate models in the history of behavior analysis: r2; also known as variance accounted for (VAC). Intuitively, r2 is a comparison of two descriptive models. One model is the model we built. How well that model describes our data is the top part of the equation. The second model is arguably the simplest description of our data that we can use: the arithmetic mean of our observations. How well the arithmetic mean value describes our data is the bottom part of the equation. To compare the two models, we simply take the ratio of how well our model describes the data compared to how well the arithmetic mean describes the data. More technically, we divide the sum of the squared residuals from our model by the sum of the squared residuals from the arithmetic mean11. And we subtract that ratio from 1.0. Models with r2 close to 1.0 mean that our model describes the data really well. Models with r2 values close to 0.0 are just as good as using the arithmetic mean to describe our data (i.e., it is not a very good model). And, models with r2 values less than 0.0 are. . .well. . .downright terrible. For even more context, good models derived from highly controlled experimental settings in behavior analysis typically have r2 values greater than 0.85 or 0.90. And, good models from more complex or applied contexts have r2 values greater than 0.60 or 0.65. VAC is so popular that we are going to give it an extra paragraph in this section around its ideal use case. The reason for this is because r2 is an appropriate fit metric only for linear regression models and is 11
Technically, we also square the residuals before comparing them. See Chapter 4 for a discussion on why. But the high-level idea is the same: we compare the squared model residuals to the squared residuals from the mean by taking the ratio between the two.
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
159
not an appropriate fit metric for nonlinear models (e.g., discounting, demand). The mathematical reasons for this are outside the scope of this book. But, interested readers should check out Spiess and Neumeyer (2010) for a description of the history for why r2 might historically have been inappropriately used across sciences and a demonstration for why other metrics are better for nonlinear regression models. But, in the end, r2 is a relatively intuitive, widely used fit metric that is best suited for linear regression models. Mean absolute error When we have nonlinear regression models, one sensical metric of model performance is to measure how “off” our description or prediction is from our observations on average. That’s what mean absolute error (MAE) does. Just as with all regression loss metrics, we start by calculating the residuals of our model. The “absolute” bit comes from then taking the absolute value of the residuals (i.e., make them all have a positive sign). Lastly, we calculate the arithmetic mean of those absolute residuals. The really nice thing about MAE is that the metric remains in the unit of our DV. For example, if our DV is responses per minute, and the MAE for our model is 1.21, then that means that our model is off in predicting behavior by 1.21 responses per minute on average. Whether this is good or bad likely depends on the social significance of being off by 1.21 responses per minute. Regardless, if you want an easy-to-interpret loss metric in the units of your DV and have a nonlinear model, MAE is likely a good choice. Mean squared error Mean squared error (MSE) is another common metric used for nonlinear regression models. Just as with r2 and MAE, we start by calculating the residuals for each observation. Next, we square each of the residuals. The idea here is to penalize residuals that are further away from the observed values compared to the residuals that are closer to the observed values. This happens because, as residuals get larger and larger, squaring them increases their impact on our loss metric exponentially. MSE is great when the MAE for several models is relatively similar as MSE can magnify subtle differences in the residual distributions to highlight models that have more residuals that are relatively large. The downside is that the loss metric is no longer in the same units as our DV making interpretation less straightforward than with MAE.
160
Statistics for Applied Behavior Analysis Practitioners and Researchers
Root mean squared error Root mean squared error (RMSE) combines some of the benefits of MSE and MAE. RMSE is calculated in the exact same way as MSE. The only difference is that, once we calculate the arithmetic mean value of the squared residuals, we then take the square root of that average. RMSE, therefore, has the benefits of penalizing larger residuals more than MAE does and also converts the final loss metric into the units of our DV for more straightforward interpretation. Thus RMSE is a great choice if you find yourself in a nonlinear regression modeling situation and you have reason to penalize models that make relatively large errors while maintaining the unit of your DV for interpretability. Metrics that penalize model complexity A limitation to MAE, MSE, and RMSE is that they provide no information about model complexity. As with many sciences, behavior analysts often prefer simple descriptions and accounts of behavior over complex descriptions and accounts of behavior (Occam’s razor). Model complexity is typically quantified by the number of free parameters in a model. Failing to account for the number of free parameters in a model is a limitation because models with a greater number of free parameters typically are better at describing and predicting behavior because they are simply more flexible and not because they theoretically make more sense. Akaike’s information criterion (AIC) and Bayesian information criterion (BIC) are two loss metrics that evaluate models on how well they describe/predict behavior as well as based on the number of free parameters the model has. AIC and BIC are calculated in very similar ways12. Both begin with the MSE (i.e., calculate residuals, square them, add them up, and divide by the number of observations). From there, both also then take the log of this value and multiply the result by the total number of observations. This left half of each equation is the portion that accounts for how well the model describes/predicts our data. The right half of each equation is the portion that penalizes each model based on the number of parameters (K) that it has. The primary difference is that AIC multiplies the number of parameters by two, whereas BIC multiplies the number of parameters by the log of the number of observations. If you started to do the math in your head, yes—when there are 100 observations, these 12 Readers interested in the foundations underlying these equations should refer to Burnham and Anderson (2016).
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
161
two equations are equivalent. But, given we often have more than 100 total data points (see Chapter 7 for a more detailed discussion of what counts as an observation), BIC typically is considered to penalize the number of parameters more than AIC.
How well does my classification model fit my data? All classification loss metrics center around whether our model did or did not predict the correct discrete category for each observation. To make this easy, we will use an example where our DV can take one of two values, “yes, behavior occurred” 5 “1” or “no, behavior did not occur” 5 “0.” Here, “1” might be a correct response during a match-tosample task, a mand emitted, aggression occurred, and so on. In this situation, there are two possible values that our behavior can take (1 or 0) and two possible values that our model can predict (1 or 0). Consequently, Table 6.4 shows the four possible outcomes that can occur with these two combinations. Our model might predict behavior occurs and behavior actually occurred (true positive); predict behavior occurs and behavior actually did not occur (false positive); predict behavior does not occur but behavior did occur (false negative); or predict behavior does not occur and behavior did not occur (true negative). As you probably can guess, we want the predictions from our model to land in the upper left and lower right “true” buckets. And, we want very few observations to fall into the lower left and upper right “false” buckets. There are many different ways that we can quantify the relative distribution of the predictions from our model across these four buckets. Table 6.5 shows commonly used loss metrics when Table 6.4 Example confusion matrix when our dependent variable is binary (i.e., is one of two values).
ACTUAL VALUES Positive (1)
Negative (0)
Positive (1)
True Positive (TP)
False Positive (FP)
Negative (0)
False Negative (FN)
True Negative (TN)
PREDICTED VALUES
162
Statistics for Applied Behavior Analysis Practitioners and Researchers
Table 6.5 Common loss metrics for classification tasks. Name
Common equation
Ideal use case
Accuracy
TP 1 TN TP 1 TN 1 FP 1 FN
Balanced count of all category labels. All categories are equally important but not the specifics about any single category.
Precision aka positive predictive value
TP TP 1 FP
It is most important to get most positive predictions right, but false positives are costly.
Negative predictive value
TN TN 1 FN
It is most important to get most negative predictions right, but false negatives are costly.
Recall aka sensitivity aka true positive rate
TP TP 1 FN
It is most important that you are capturing all positive instances and it is costly to miss one.
Specificity aka true negative rate
TN TN 1 FP
It is most important that you are capturing all negative instances and it is costly to miss one (e.g., it would be costly for someone to not be prepared for the occurrence of behavior).
F1 (generalizes to Fβ)
3 recall 1 1 β 2 3 β2Precision Precision 1 recall
Detecting the occurrence of the phenomenon is most important and both precision and recall are equally important.
False positive rate aka Type I error
FP FP 1 TN
The cost of raising a false alarm is very high and you want to avoid it.
False negative rate aka Type II error
FN TP 1 FN
The cost of failing to detect the occurrence of the phenomenon is high.
Matthew’s correlation coefficient
TP 3 TN 2 FP 3 FN pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Imbalanced count across category labels. All classes equally important but not the specifics about any single category.
ðTP 1 FPÞðTP 1 FNÞðTN 1 FPÞðTN 1 FNÞ
FN, False negative; FP, false positive; TN, true negative; TP, true positive.
we have a binary DV. Each metric has its strengths and weaknesses with researchers and practitioners typically using more than one metric to quantify how well a classification model performs. Lastly, as described in the section titled, “Classification models,” extending these out to situations where our DV has three or more potential labels is easy. You just create a confusion matrix modified for a one-vs.-rest analysis and then calculate the metric; or you create a confusion matrix for each one-vs.-one analysis, calculate your metric for each, and analyze the resulting set of loss metrics. Accuracy Accuracy is likely the classification loss metric that is most familiar to behavior analysts. Accuracy is calculated by simply counting up the number of times the model was “right” (true positive and true negative) and dividing by the total number of predictions. If multiplied by 100, accuracy is simply the percentage of predictions the model got right.
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
163
Accuracy is only a good metric when two conditions are met. The first condition is that you have a similar amount of both positive and negative category labels. To see why, consider a situation where 99 out of 100 times the behavior occurs. A very simple and noninsightful model would be to always predict the behavior will occur. In this situation, the model would have an accuracy of 99% but it would be hard to argue the model is telling us anything useful. The second condition wherein accuracy is useful is when we do not care about what kind of correct or incorrect predictions our model makes. As described above with Table 6.4, our model can make correct and incorrect predictions in two ways each. Accuracy does not differentiate the two correct predictions from each other nor the two incorrect predictions from each other. To the extent that you have imbalanced category counts or that it matters you can differentiate correct/incorrect categories, accuracy is not the best metric to use. Predictive value Predictive values are a second set of metrics often used to quantify how well classification models make predictions (rows 2 and 3; Table 6.5). As the names imply, positive predictive value (aka precision) focuses on how well our model predicts the positive values (the “1”s) and negative predictive value focuses on how well our model predicts the negative values (the “0”s). In Table 6.4, each predictive value metric focuses on one of the rows in the confusion matrix. Using the example above where behavior occurs on 99/100 trials, if it is really important that our model predicts that one instance where behavior does not occur, then negative predictive value captures that. Here, always predicting behavior will occur which would lead to 0% negative predictive value as it misses the one opportunity it had (i.e., 0/[1 1 0]). This obviously isn’t the entire story as the model does well with positive predictions coming in at a 99% positive predictive value. As a result, most researchers using predictive value metrics would report on both positive and negative predictive values rather than either in isolation. And, if you have a model that has a really high positive/negative predictive value but not the other, then you know the model needs some improvement. True rates True rates are often used to quantify how well classification models capture the true rates that positive and negative labels occur (rows 4 and 5; Table 6.5). As the names imply, true positive rate (aka recall
164
Statistics for Applied Behavior Analysis Practitioners and Researchers
aka sensitivity) focuses on the extent to which our model captures all positive instances that occurred, and true negative rate (aka specificity) focuses on the extent to which our model captures all negative instances that occurred. In Table 6.4, each of these focuses on one of the columns in the confusion matrix. These metrics are typically used when it would be a big deal to miss either a positive occurrence (e.g., a diagnostic test failing to detect cancer when it is present; a Registered Behavior Technician [RBT] failing to document that abuse was observed) or to miss a negative occurrence (e.g., failing to rule out disease with expensive and intensive treatment regimen; failing to rule out a behavioral function leading to wasted treatment resources). Similar to predictive values, true rates are often reported in conjunction with one another rather than in isolation. F1 (Fβ) Positive predictive value (precision) focuses on the top row of Table 6.4 and true positive rate (recall) focuses on the left column of 6.4. Given both of these focus on different aspects of positive occurrences, researchers often want to quantify how well classification models balance both predictive values and true positive rates. Fβ scores do this by combining positive predictive value and true positive rate into a single, balanced metric. The most common way to do this is to give equal weight to each by setting β 5 1. When β 5 1 it is called the F1 score. However, β is a continuous number and, technically, you could swap it out for whatever positive real number you’d like (i.e., any value between 0 and infinity). β values greater than 1 give more weight to true positive rates (recall) and values between 0.0 and 1.0 give more weight to positive predictive value (precision). Common versions of Fβ are F2 which gives twice as much weight to true positive rates and F0.5 which gives twice as much weight to positive predictive value. Lastly, if the absence of the phenomenon was more important to your research or practice context, you could calculate Fβ scores using negative predictive value and true negative rates. False rates As the name implies, these metrics tell you how often your model makes an error out of all the opportunities it had to make an error. If you have taken any kind of statistics class in your life, you likely have heard of Type I and Type II errors. Those are just other names for false rates. Type I errors (aka false positive rate) quantify how often
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
165
our model incorrectly predicts behavior would occur relative to all opportunities wherein it could have made this kind of error. Type II errors (aka false negative rate) quantify how often our model incorrectly predicts behavior would not occur relative to all opportunities wherein it could have made this kind of error. False positive and false negative rates are typically used when the cost of making an error is much greater than the benefit of getting predictions right. This is why Type I and Type II errors are often used when analyzing experimental results. To err on the side of caution, scientists want to make sure the likelihood is low that they make an error (i.e., claim an effect exists when it actually may not). For classification modeling, specifically, researchers will often use false positive and false negative rates in conjunction with either positive/negative predictive values or true positive/negative rates. This allows scientists to get a balanced look at how well a model is capturing all four possible model outcomes. Matthew’s correlation coefficient Matthew’s correlation coefficient (MCC) is a great single metric that is an alternative to accuracy for looking holistically at the performance of classification models. Though the mathematical reasoning behind MCC is well beyond the scope of this book, MCC has shown to be quite the robust metric. Specifically, MCC is a more reliable loss metric than accuracy and F1 score; MCC is high only when predictions are good for all four categories in the confusion matrix; and MCC performs well with imbalanced datasets (Chicco & Jurman, 2020). We saved it for last so that you would have a good sense of the limitations of accuracy, you would understand the benefit of focusing on individual categories in the confusion matrix, and you would know what an F1 score is and its limitations. There still is significant benefit to reporting on multiple metrics as is common with many of the metrics above. However, if you want to use only one metric to quantify how well your model performs and that balances all four categories, MCC is likely the best first choice. Loss metric summary To reorient ourselves in the landscape of this chapter, we start by identifying what kind of model is best suited to our IV-DV context. From there, we let computers do their thing and fit the model to our data so that we can describe it or make predictions about future collected
166
Statistics for Applied Behavior Analysis Practitioners and Researchers
data. Next, we examine how good our model actually is by converting our residuals or confusion matrix counts into a loss metric wherein the chosen metric is determined practically by the DV and how we want to communicate about it afterward. The value of the loss metric allows us to determine whether we can move to the final stage of model evaluation. What counts as a “good” model depends on the loss metric and model type. For linear regression tasks, r2 values greater than 0.85 (experimental contexts) or 0.60 (applied/complex contexts) are typically considered “good.” For nonlinear regression models, MAE and RMSE are often used because it keeps the metric in the units of our DV. Here, “good” is determined pragmatically. For regression model comparisons, AIC and BIC are better choices because they evaluate model complexity and model fit simultaneously. AIC and BIC, however, are relative measures (i.e., they can’t really be interpreted in isolation) with lower values meaning better models. Lastly, a suite of loss metrics exist for classification models. Scientists will typically pick a pair of metrics based on what is most important to their context. These include the model’s predictive capabilities (predictive value), how well it captures true rates of each category, or the types of errors the model makes. Sometimes researchers want a single metric that blends either a pair of metrics (Fβ) or captures all four correct/incorrect possibilities in a single number (accuracy, MCC). Though accuracy is likely the most familiar to behavior analysts, it is an appropriate metric in only specific circumstances and past research suggests MCC is the best single metric for balancing all four correct/incorrect possibilities.
How influential is each independent variable on my dependent variable? If you have a good model fit, congratulations! You have successfully taken a suite of measured IVs, some likely controlled and some likely not; and you have found a way to quantify the precise relationship between those IVs and your DV within the environmental contexts they occurred. This is no easy feat! You now get to enjoy the deliciousness of which this entire enterprise is about in learning about the relative influence of each of those IVs on your DV. You can do this by way of evaluating the values that were assigned to each free parameter
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
167
in your model. And, likely unsurprising to you at this point in the chapter, what exactly those free parameter values mean depends on the type of model we used. Refresher on free parameters A free parameter is any variable in our model that we did not measure directly and we did not have a specific set of values for in our dataset. For example, consider the classic equation for delay discounting: V5
A : 1 1 kD
(6.4)
In this nonlinear regression model, V is the value13 of a larger later reinforcer at the specific delay of D; A is the measured amount of the reinforcer (e.g., $100, 10 minutes of iPad, 10 pieces of candy, 10 seconds of tickles); and k is the rate whereby value decreases as a function of increasing delay. In delay discounting experiments, we control for the amount of the reinforcer (A) and the delay (D) to when the participant would contact A. We also directly measured the choice that participants made (our DV) that equated to the value (V) of the reinforcer amount A at time point D (i.e., larger later reinforcer). Thus V, A, and D are all “fixed” values in our dataset—we cannot change those measurements when we fit the model as the data has already been collected. When we fit Eq. 6.4 to our data, the only variable that is “free to vary” per participant is k. Thus k is a free parameter. And, a participant’s k value tells us the influence of increasing delay (our IV) on reinforcer value (our DV) for that participant. One final note on terms. Different areas of science sometimes use different terms to refer to the estimated values for each free parameter. Names you are likely to see in the wild include model coefficients, variable coefficients, coefficients, beta weights, variable weights, feature weights, feature importance, and variable importance. At a high level, they all mean essentially the same thing: the estimated value of a free parameter in our model that is meant to capture the relative influence of that IV on the DV. 13
By value, we just mean the amount of behavior that the reinforcer is likely to maintain. Typically “behavior” is made even more specific to the context such as the duration, intensity, rate, latency, or relative allocation.
168
Statistics for Applied Behavior Analysis Practitioners and Researchers
Interpreting linear regression parameters Interpreting linear regression model coefficients is the easiest of all the model types. Quite simply, the estimated free parameter in a linear regression model describes how much your DV will change for each one unit increase in the IV. If the coefficient is positive, then it means that every one unit increase in your IV leads to an increase in your DV by the amount of the coefficient. And, if the coefficient is negative, then it means that every one unit increase in your IV leads to a decrease in your DV by the amount of the coefficient. When you have more than one IV in your model, the coefficient represents the isolated influence of that IV on your DV (i.e., holding all other IVs constant). Importantly, most statistical outputs will provide you with a confidence interval for the estimated value of the free parameter and a p value. If the confidence interval contains 0.00 or the p value is greater than 0.05, it is unlikely that your IV has a consistent influence on the DV for that participant. Interpreting nonlinear regression parameters Well, that was the end of the easy stuff. Now things get a bit more. . .messy. Let’s consider again two common nonlinear models in behavior analysis in discounting and demand (Fig. 6.3). When we fit these models to the data, we can interpret the parameters kind of similar to above. For every one unit increase in our IV, the DV changes by the amount and in the direction of the estimated free parameter. However, the amount and direction changes based on the structure of the nonlinear model so it’s not quite the easy interpretation we get with linear models. Let’s use the delay discounting equation as an example of the complexity (Eq. 6.4). The structure of this model is called a “hyperbolic” model because of the type of curve it produces (Fig. 6.3). At small delays, a one unit change in delay (e.g., increasing the delay from one to two weeks) leads to a large reduction in reinforcer value compared to a one unit change at larger delays (e.g., increasing the delay from 10 years to 10 years and one week). You can see how interpreting an estimated k value of 0.8 gets messy. How much does reinforcer value change with every one unit increase in delay? Well, it changes by 0.8 hyperbolic units of reinforcer value (in the mathematical sense not the exaggerated sense). For most, this is not intuitive to “see” without an accompanying graph. But, that’s okay because k is still a precise
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
169
statistical estimate that allows for quantitative descriptions and predictions of behavior. Thus we can practically use k values for withinsubject comparisons of discounting with different commodities (e.g., Baker et al., 2003), between-subject comparisons of discounting in different clinical populations (e.g., Baker et al., 2003), and to predict the likelihood that an intervention will be successful for any one participant (e.g., Dallery & Raiff, 2007; Reed & Martens, 2013). Returning to a broad generalization, the estimated parameters in nonlinear regression models tell us the amount and direction our DV changes relative to the model’s structure. If we have multiple IVs in our model, then the estimate tells us the relative influence of that IV holding all others constant. Lastly, like all statistical estimates covered in this book, we can practically use that quantitatively precise estimate of the influence of our IV on our DV to describe and predict behavior toward an answer to a research or practice question. Interpreting classification model parameters Interpreting the estimated parameters from classification models is also a bit messy, technically but has a high-level interpretation that is easy to grasp. Let’s start with the easy bit. As described earlier in this chapter, the output of most classification models is a probability that behavior will fall into a specific category. If we have two categories, it is the probability behavior will fall into whatever we defined as a “1.” If we have more than one category (e.g., one-vs.-rest), then it is the probability that behavior will fall into whatever “one” category we are focusing on. Thus, at a high level, you can think of the estimated parameters in classification models as telling you how much a one unit change in our IV will change the probability that behavior will fall into category “1.” Thus greater parameter estimates suggest that IV increases the probability more than lesser parameter estimates. Also similar to above, if we have multiple IVs, the parameter estimates provide you with the relative influence of that IV on your DV holding all others constant. Easy enough, right? Now the messy part. Kind of like nonlinear regression models, the parameter estimates that your computer spits at you are in units that depend on the type and structure of the model. What this means is that you have to use a specific equation to convert the parameter estimates into the actual change in probability that your IV has on your DV. Fortunately, Google is your friend here. Simply search, “How to
170
Statistics for Applied Behavior Analysis Practitioners and Researchers
convert [model] coefficients into probabilities?” and you will likely find the website you need to help you on your quest. Though this requires a bit of extra work, those that partake will be rewarded with parameter estimates that are much easier to interpret and you will better understand the precise influence of each IV on the probability of observing your DV. Interpreting relative influence in multivariable models We will close this section on interpreting parameter estimates by discussing how to interpret the relative influence of multiple IVs on your DV. In most multivariable modeling situations, we are interested not just in the size of the influence that each IV has on our DV, but we want to know which IVs are more important or more influential than the other IVs. To do this requires that we transform all of our IVs into a similar scale before we fit the model to our data. To demonstrate why we need to do this, consider a situation where we want to know the precise quantitative influence of reinforcement schedule and reinforcer amount on rate of responding. In this situation, the parameter estimates from our regression model will be in the units of reinforcement schedule and reinforcer amount. Comparing parameter estimates of, say, 2.0 for fixed response (FR) requirement changes and 1.0 for the number of tokens is a bit like comparing apples and oranges (well, technically, schedules and tokens). It’s unclear how to easily do that to answer a question about whether schedule requirements or reinforcer amount have a greater influence on response rates. But, never fear, smart people throughout history have solved this challenge for us. The most common method is to use data transforms to get our IVs on the same page. A quick Google search for “methods of feature scaling” will likely return to you a very large list of ways you can go about doing this. For example, min-max scaling converts your IVs into the range of 0.001.00 corresponding to the min and max values of your IV, respectively. And, standard scaling (aka converting to a z-score) converts your IVs into their location on a normal curve (see Chapter 314). Such feature scaling is useful because now all our IVs are on the same scale in terms of the size they can take. The benefit is we can now compare the parameter estimates to one another and obtain a precise 14
A note of caution: This transformation only makes sense if your IV is normally distributed.
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
171
estimate of how much more of an influence one IV has on our DV compared to another. For example, had we “feature scaled” our IVs in the previous paragraph first, parameter estimates of 2.0 and 1.0 for FR schedule and number of tokens, respectively, would let us know that FR schedule has twice the influence on rates of responding as the number of tokens we use with each reinforcer delivery. As a result, you could use this information to more precisely predict how changing each of these IVs will change behavior. There is a downside to feature scaling. Because we have converted the units of our IVs out of their original form, our estimated parameters no longer are in the units of our IVs but in the transformed units. As a result, if you want to interpret the parameter estimates as we described earlier while also interpreting the relative influence of each parameter, you can always fit two models: once with the IVs in their raw, original values and once with the scaled IVs.
Chapter summary In many research and practice situations, we are likely to find ourselves with a set of data on one or more IVs and one or more measures of behavior (DVs). Experimentation is obviously the ideal method to determine exactly how one or more IVs influence behavior, it is not always possible nor practical. In these situations, we may still want to get a sense for how different experimentally and nonexperimentally controlled variables influence behavior. Modeling helps us do that. The method and approach to modeling we take depends on several factors. These include (1) whether our DV is continuous or discrete; (2) whether we have one or more IVs; and (3) the assumed linear or nonlinear relationship between our IVs and DV. Once we build a model, we evaluate how good it is by analyzing loss metrics. Loss metrics also are chosen based on several factors. These include (1) whether our DV is continuous (regression models) or discrete (classification models) and (2) what we want to emphasize in the analysis (i.e., the units of our DV or model comparisons for regression; specific cell(s) in the confusion matrix for classification). Lastly, assuming we have a model that describes our data well or makes accurate predictions, we can interpret the estimated parameters of our model. Typically, these tell us the relative influence each IV in
172
Statistics for Applied Behavior Analysis Practitioners and Researchers
our model has on our DV. Modeling with the raw IV values allows us to interpret the influence of one unit changes in our IV on our DV where the DV changes based on the structure of our model (e.g., linearly, hyperbolically, exponentially). However, this approach does not provide us with comparable estimated parameters as our IVs are often in different units. Modeling only after feature scaling solves this problem and allows us to make claims about how much more of an influence one IV has on behavior compared to other IVs. To close, if you understood everything in this chapter summary then a sincere and tremendous, “Bravo!” We appreciate this material is a bit dense and, as mentioned before, modeling deserves a book-long treatment in its own right. Nevertheless, we hope this introduction at least whets your whistle and you can begin your journey toward controlling for nonexperimentally controlled variables and making sense of the beautifully complex world we live in.
References Baker, F., Johnson, M. W., & Bickel, W. K. (2003). Delay discounting in current and neverbefore cigarette smokers: Similarities and differences across commodity, sign, and magnitude. Journal of Abnormal Psychology, 112(3), 382392. Baum, W. M. (1974). On two types of deviation from the matching law: Bias and undermatching. Journal of the Experimental Analysis of Behavior, 22(1), 231242. Borges, A. M., Kuang, J., Milhorn, H., & Yi, R. (2016). An alternative approach to calculating area-under-the-curve (AUC) in delay discounting research. Journal of the Experimental Analysis of Behavior, 106(2), 145155. Available from https://doi.org/10.1002/jeab.219. Burnham, K. P., & Anderson, D. R. (2016). Multimodel inference: Understanding AIC and BIC in model selection. Sociological Methods & Research, 33(2), 261304. Available from https://doi. org/10.1177/0049124104268644. Chicco, D., & Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21, 6. Available from https://doi.org/10.1186/s12864-019-6413-7. Cooper, J. O., Heron, T. E., & Heward, W. L. (2020). Applied behavior analysis (3rd ed.). Pearson. Cox, D. J., & Dallery, J. (2016). Effects of delay and probability combinations on discounting in humans. Behavioural Processes, 131, 1523. Available from https://doi.org/10.1016/j.beproc. 2016.08.002. Dallery, J., & Raiff, B. R. (2007). Delay discounting predicts cigarette smoking in a laboratory model of abstinence reinforcement. Psychopharmacology, 190, 485496. Available from https:// doi.org/10.1007/s00213-006-0627-5. Dallery, J., & Soto, P. L. (2013). Quantitative description of environment-behavior relations. In G. J. Madden (Ed.), APA handbook of behavior analysis (Vol. I, pp. 219249). American Psychological Association, Methods and principles. Available from https://doi.org/10.1037/13937010.
Oh, shoot! I forgot about that! Estimating the influence of uncontrolled variables
173
Davison, M. (1982). Performance in concurrent variable-interval fixed-ratio schedules. Journal of the Experimental Analysis of Behavior, 37, 8196. Davison, M., & McCarthy, D. (1988). The matching law: A research review. Lawrence Erlbaum Associates, ISBN: 090959-923-7. Davison, M. C., & Tustin, R. D. (1978). The relation between the generalized matching law and signal detection theory. Journal of the Experimental Analysis of Behavior, 29(2), 331336. Available from https://doi.org/10.1901/jeab.1978.29-331. Grow, L. L., Carr, J. E., Kodak, T. M., Jostad, C. M., & Kisamore, A. N. (2011). A comparison of methods for teaching receptive labeling to children with autism spectrum disorders. Journal of Applied Behavior Analysis, 44(3), 475498. Available from https://doi.org/10.1901/jaba. 2011.44-475. Hunter, I. W., & Davison, M. (1982). Independence of response force and reinforcement rate on concurrent variable-interval schedule performance. Journal of the Experimental Analysis of Behavior, 37, 183197. Hursh, S. R., & Silberberg, A. (2008). Economic demand and essential value. Psychological Review, 115(1), 186198. Available from https://doi.org/10.1037/0033-295X.115.1.186. Kahng, S. W., Chung, K. M., Gutshall, K., Pitts, S. C., Kao, J., & Girolami, K. (2010). Consistent visual analyses of intrasubject data. Journal of Applied Behavior Analysis, 43(1), 3545. Available from https://doi.org/10.1901/jaba.2010.43-35. Killeen, P. R. (1994). Mathematical principles of reinforcement. Behavioral and Brain Sciences, 17, 107172. Killen, P. R., & Sitomer, M. T. (2003). MPR. Behavioural Processes, 62, 4964. Available from https://doi.org/10.1016/S0376-6357(03)00017-2. Liu, Q., Shepherd, B. E., Li, C., & Harrell, F. E., Jr (2017). Modeling continuous response variables using ordinal regression. Statistics in Medicine, 36(27), 43164335. Available from https:// doi.org/10.1002/sim.7433. Madden, G. J., & Perone, M. (2013). Human sensitivity to concurrent schedules of reinforcement: Effects of observing schedule-correlated stimuli. Journal of the Experimental Analysis of Behavior, 71(3), 303318. Available from https://doi.org/10.1901/jeab.1999.71-303. McCarthy, D., & Davison, M. (1980). Independence of sensitivity to relative reinforcement rate and discriminability in signal detection. Journal of the Experimental Analysis of Behavior, 34(3), 273284. Available from https://doi.org/10.1901/jeab.1980.34-273. McDowell, J. J. (1989). Two modern developments in matching theory. The Behavior Analyst, 12 (2), 153166. Merriam-Webster (2022). Model. Retrieved from: https://www.merriamwebster.com/dictionary/ model. Michael, J., Palmer, D. C., & Sundberg, M. L. (2011). The multiple control of verbal behavior. The Analysis of Verbal Behavior, 27(1), 322. Available from https://doi.org/10.1007/ BF03393089. Moore, J. (2010). Behaviorism and the stages of scientific activity. The Behavior Analyst, 33(1), 4763. Available from https://doi.org/10.1007/BF03392203. Moxley, R. (1982). Graphics for three-term contingencies. The Behavior Analyst, 5(1), 4551. Available from https://doi.org/10.1007/BF03393139. Rachlin, H. (2006). Notes on discounting. Journal of the Experimental Analysis of Behavior, 85 (3), 425435. Available from https://doi.org/10.1901/jeab.2006.85-05. Reed, D. D. (2009). Using Microsoft® Office Excel® 2007 to conduct generalized matching analyses. Journal of Applied Behavior Analysis, 42, 867875. Available from https://doi.org/10.1901/ jaba.2009.42-867.
174
Statistics for Applied Behavior Analysis Practitioners and Researchers
Reed, D. D., & Martens, B. K. (2013). Temporal discounting predicts student responsiveness to exchange delays in a classroom token system. Journal of Applied Behavior Analysis, 44(1), 118. Available from https://doi.org/10.1901/jaba.2011.44-1. Rodriguez, M. L., & Logue, A. W. (1986). Independence of the amount and delay ratios in the generalized matching law. Animal Learning & Behavior, 14, 2937. Available from https://doi. org/10.3758/BF03200034. Schroeder, S. T., Hovell, M. F., Kolody, B., & Elder, J. P. (2004). Use of newsletters to promote environmental political action: An experimental analysis. Journal of Applied Behavior Analysis, 37, 427429. Available from https://doi.org/10.1901/jaba.2004.37-427. Skinner, B. F. (1957). Verbal behavior. Appleton-Century-Crofts. Spiess, A. N., & Neumeyer, N. (2010). An evaluation of R2 as an inadequate measure for nonlinear models in pharmacological and biochemical research: A Monte Carlo approach. BMC Pharmacology, 10, 6. Available from http://www.biomedcentral.com/1471-2210/10/6. Vanderveldt, A., Green, L., & Myerson, J. (2015). Discounting of monetary rewards that are both delayed and probabilistic: delay and probability combine multiplicatively, not additively. Journal of Experimental Psychology. Learning, Memory, and Cognition, 41(1), 148162. Available from https://doi.org/10.1037/xlm0000029. Vollmer, T. R., & Bourret, J. (2000). An application of the matching law to evaluate the allocation of two- and three-point shots by college basketball players. Journal of Applied Behavior Analysis, 33(2), 137150. Available from https://doi.org/10.1901/jaba.2000.33-137.
CHAPTER
7
How fast can I get to an answer? Sample size, power, and observing behavior Statistics are a simple utilitarian argument. Just make sure the n’s justify the means.
Introduction Welcome back to another exciting chapter of your life! By finishing the previous chapter, you are officially over halfway through this book. Well done! In thinking about the landscape of our statistical use of numbers, let’s briefly review where we are and how we got here. To begin, we opened the book by defining statistics as a branch of mathematics whose topic is the collection, analysis, interpretation, and presentation of aggregate quantitative data (Merriam-Webster, 2021). Exactly how we aggregate individual datum (i.e., create a statistic) depends on the type of data we have and what questions we want to ask and answer (Chapter 2). If we are interested in describing what we have observed, most people use two statistics. One that tells the readers/listeners what they can most likely expect under similar conditions (i.e., central tendency; Chapter 3) and one that tells the readers/listeners the variability they are most likely to observe (i.e., variation; Chapter 4). This all falls under the label of descriptive statistics. Sometimes we want to not only describe a set of data, but we want to compare two or more sets of data and determine whether they are different. To do this, we often have to make some assumptions so that we can infer whether the differences we observed between two datasets are truly meaningful or may have occurred because behavior is multiply controlled, variation is a rule rather than an exception, and all measurement schemes capture only a subset of all possible observations. Chapter 5 walked through three common ways that people have historically compared two or more datasets. These were statistical significance, effect sizes, and social significance. Statistics for Applied Behavior Analysis Practitioners and Researchers. DOI: https://doi.org/10.1016/B978-0-323-99885-7.00003-9 © 2023 Elsevier Inc. All rights reserved.
176
Statistics for Applied Behavior Analysis Practitioners and Researchers
Finally, in Chapter 6, we reviewed what happens when people want to go beyond simply determining whether two or more datasets are different. Specifically, we introduced the umbrella label of quantitative modeling which has been used extensively by researchers in the experimental analysis of behavior. Though modeling experimental data is very nice for helping us to make causal claims, we don’t always have the opportunity to control everything, everywhere, and all at once. Thus we showed how behavior analysts can analyze how many different variables might independently or interactively influence behavior (assuming you collected data on what you were interested in). And, we talked about the two most common types of modeling in regression analyses (when your dependent variable [DV] is continuous or ordinal with many levels) and classification modeling (when your DV is categorical). Critical to all the chapters just reviewed is an important practical question: How much data do I need so that my claims are accurate and stable? One observation certainly is not enough. One trillion observations are likely way more than we need. But how do we draw a line between these two extremes? Knowing when to stop collecting data becomes critical to how well we can make claims that our measures of central tendency and variation are accurate and are unlikely to change much with more data. Answering this is also critical to claims that our measures of statistical significance, effect size, or social significance are accurate and stable. And, answering this is critical to knowing that our models accurately characterize the relationship between independent variables (IVs) and DVs. The purpose of this chapter is to highlight how people have answered these questions in the past.
Enough is enough As with everything to this point in the book, finding answers to the question around “How many observations are needed?” is not new. Scientists have likely wrestled with this question probably as long as they have been collecting data to help them answer their questions about the natural world. Essentially, the question boils down to, “How do we know that we have enough observations to stop?” There are at least two ways we can frame this questionanswer combination.
How fast can I get to an answer? Sample size, power, and observing behavior
177
The first way to frame this question-answer combo relates to descriptive statistics or the visual analysis of intervention effects. Generally, the answer here is to continue to collect data until you have a “repeatable result within a tolerable range of variability.” We want to be pretty confident that we’ve measured well the thing we’re measuring, be it for a description of session responses per minute or the “true” effect of baseline or intervention contingencies on behavior. Stated differently, we know we can stop when the answer to our question will change little as we continue to burn through valuable resources to add more observations. This sounds nice and straightforward, but to paraphrase Heraclitus, “The only constant in life is change.” We can always expect some amount of variability between observations in our measures of the environment and behavior. So a related question becomes, “How much variability is expected such that we’re fairly certain what we’re measuring is unlikely to change if more data were collected?” Succinctly, is our measure of central tendency changing little between observations? The second way to frame this question-answer combo relates to situations where we don’t see much of a difference between two or more datasets. Here the question becomes whether we have collected enough data such that we would have detected a difference if one truly exists. Answering these kinds of questions is likely most relevant to inferential statistics and a fun set of equations referred to as power analyses (more on these below). However, it’s also likely that someone relying solely on visual analysis might find themselves asking this question. And, it’s always a good habit to think proactively about what pattern in your data you would need to see to state “I’m confident that this didn’t work and I need to try something new.” Different people with different educational and experiential backgrounds are likely to answer these questions differently. And, it’s unlikely that any single answer will be the best for all situations for all people across all time. As such, this chapter highlights how different people might answer these questions as we scale our decisions of “when to stop data collection” from the trial level through group design research; that is, as we scale from N 5 number of responses through N 5 number of participants. Throughout, we’ll do our best to interweave answers from the visual and statistical camps. But, significant gaps still exist in this arena and Jason and David certainly don’t
178
Statistics for Applied Behavior Analysis Practitioners and Researchers
have all the answers in life. But, we’ll stop our introduction here assuming we have provided a large enough contextual sample for you to understand the importance of this chapter.
Observations 5 responses Most behavior analysts likely collect their data at the level of an operant response. We could extend our observations down to the moment-by-moment musculoskeletal and neural patterns of activity that aggregate to what we tact as “pointing to the color green” or “aggression.” However, rarely do we get that molecular in our data collection. Rather, behavior analysts most often stick to the “functional units” defined by a three-term contingency of some kind such as: “point to the color green.” (SD)-. [any response that involves the person touching a green card] (behavior)-. [delivery of a putative reinforcer] (SR1); or “point to the color green.” (SD)-. [any alternative response to aggression] (behavior)-. [delivery of a putative reinforcer] (SR1). As noted throughout the book, behavior analysts rarely use the 1s and 0s data that are collected in the above learning trials (0 5 target behavior did not occur; 1 5 target behavior occurred). That is, though we collect data on a trial-by-trial or moment-by-moment basis whether a response occurred (or is occurring in the case of duration measures), we typically aggregate data from many observations to create a single datum that we then plot on a graph. These aggregated data points might be the percentage of trials with the correct response, the average number of aggressive responses per minute during the session, or the average duration of time-on-task in an education setting. The question this section asks is, “When you plot that data point, how do you know it accurately represents the individual’s “true” ability?” Though this may sound absurd, consider Fig. 7.1 where different within-session patterns of behavior lead to identical data points being plotted when aggregated at the session level. Let’s look first at the top set of data showing the percentage of trials wherein a hypothetical client emitted the correct response. The five sessions from Client A correspond with one of the two graphs in the top panel and the five sessions from Client B correspond with the other. But which is which (we’ll save you the time and let you know they are perfectly identical graphs)? However, if you look closely at Client A’s data, you’ll notice
How fast can I get to an answer? Sample size, power, and observing behavior
179
Figure 7.1 Demonstration how different patterns of responding (data on the left) aggregate away important patterns of responding and lead to identical graphs.
that they always got the first 6 8 trials correct before getting the rest incorrect. In contrast, Client B responded correctly and incorrectly more evenly throughout all 10 trials. So are the data in the graphs representative of these clients’ true abilities? Let’s start with Client A. The statistic being shown is the percentage of trials that they emitted the correct response. We’re probably interested in whether they “know” the thing we are trying to teach (e.g., color tacts, matching identical objects). As a representation of what they “know,” percentage correct is a poor statistic for Client A. How do we know? Because, based on these patterns here, we’d plot 100% correct with five trials and somewhere between 30% and 40% if we ran 20 trials. Our decision when to stop collecting data influences our measure of Client A’s “ability”. And, if a measure of someone's “ability” depends on our decision about how long we measure it, then it’s likely not a very accurate or stable measure. In contrast, let’s look at Client B’s data. Their pattern of responding is much more evenly distributed across the 10 trials. If we stopped at five trials or extended the trials out to 20, our graphs would likely look just about the same. That is, the statistic we are using appears stable as a function of the number of responses included in our aggregation. Thus our measure of whether Client B “knows” the thing we are trying to teach does not seem to depend on how long we measure it. So, for Client B, graphing the percentage of trials with correct responding appears to accurately represent their behavior.
180
Statistics for Applied Behavior Analysis Practitioners and Researchers
The bottom sets of data in Fig. 7.1 show the same idea but based on the statistic of responses emitted per minute. Again, we have identical graphs but very different patterns of responding across the 10minute sessions being used to create the graphs. For Client C, we have near zero instances of the response except for a single minute where all the action happens. Compare that to the pattern of responding from Client D where a similar number of responses occur per minute throughout the entire 10-minute session. Similar to the comparisons above, if we had arbitrarily stopped the session at 5 minutes or extended it to 20 minutes, our measure of responding would change for Client C but would likely remain the same for Client D. Thus our statistic is likely not a great representation of behavior for Client C but is likely a good representation of behavior for Client D. Our measure of behavior shouldn’t depend on the arbitrary choices we make. You might be asking yourself, why does this matter? Well, glad you asked! Remember that we rarely collect data just to have piles of data sitting around. We collect data to understand how the environmental conditions we have created allow us to describe, predict, and control responding. And, we collect data to understand how we can help someone’s life be even better by modifying the environmental conditions of which behavior is a function. Consider the skill acquisition data from Fig. 7.1. The response patterns observed from Client A suggest our next steps might be to manipulate some aspect of the contingency (establishing operation, putative reinforcer) to promote the desired performance for a greater number of trials; reflect on whether it matters if they can emit the correct response more than five times in a single session1; or maybe something else. In contrast, the response patterns observed from Client B suggest our next steps might be to conduct an analysis to evaluate undesirable sources of stimulus control (e.g., position or stimulus biases); revisit our prompt type, prompt fading approach, or error correction procedure; or maybe something else. There’s a more general point here. The general point is that aggregating many responses into a single statistic (e.g., percentage of trials with correct responding, responses per minute) necessarily removes information on the pattern—the distribution—of responses at smaller 1
Of note, many decisions in life are one-shot decisions wherein we make a single decision with a fair amount of time before a similar decision comes along (e.g., Guo & Li, 2014; Guo, 2011).
How fast can I get to an answer? Sample size, power, and observing behavior
181
timescales. And, just as was discussed in Chapters 2 4, the accuracy of the statistic you have chosen depends on the distribution of responses at that smaller timescale. If you would get a similar answer within some tolerable range of variability regardless of how much time your observation spans or how many trials you run, then the statistic is likely a representative measure of behavior and we can stop collecting data (Clients B and D in Fig. 7.1). If you would get a different answer outside our tolerable range of variability depending on how much time your observation spans or how many trials you run, then the statistic is likely not a representative measure of behavior and we need to continue collecting data or change something up (Clients A and C in Fig. 7.1). Importantly, the clinical decisions you make might differ based on those within-session patterns of responding. How have people handled this, historically? Unfortunately, there are not a lot of great examples here2. Most published datasets in behavior analysis simply show us the statistical aggregation that captures the totality of all responding within each session. We haven’t read a published article where the researchers discussed or presented evidence that the statistical measure they chose to represent responding within each session was, in fact, an accurate measure of responding. Our assumption is that researchers have simply ensured the data are truly representative of the underlying patterns or else they didn’t include those participants within the analysis. But who knows.
2
Some researchers have asked seemingly related questions around: continuous data collection versus time sampling (e.g., Meany-Daboul et al., 2007); others have compared decisions around acquisition looking at first trial performance to performance across all trials in a session (e.g., Lerman et al., 2011); and still others have asked about whether global measures of treatment integrity were representative of performance on individual steps within a treatment protocol (e.g., Cook et al., 2015). However, this is different from what we’re talking about here. In each of these examples, the question revolves around the efficiency of a decision being made (e.g., does the individual know the thing we’re teaching?). As a generalized question, “Do more efficient within-session data collection or analysis procedures allow us to make functionally similar decisions about within-condition response patterns?” But, in this section of this chapter, we’re asking how well our aggregate measure of within-session responding accurately reflects behavior from within the same session. A binary data distribution (e.g., first trial data collection) can only be an accurate representation of a continuous or ordinal data distribution (e.g., 10-trial data collection) if the individual gets 100% or 0% correct. Anything in between will necessarily differ. So, statistically they would not be equivalent. But, when we think about how we’re using the data at a higher level (within-condition analyses), they might be functionally equivalent.
182
Statistics for Applied Behavior Analysis Practitioners and Researchers
Figure 7.2 Analyzing within-session responding to determine whether an aggregated session statistic is representative of responding. Can you guess which client belongs to which graph?
Fig. 7.2 shows one method whereby researchers or practitioners could run a gut check around this issue3. The graphs use the data from the first two sessions of each client in Fig. 7.1 (sessions 1 and 2 are shown as different markers and data paths). However, rather than showing the aggregate session statistic, we have plotted what the statistic would have been had we stopped collecting data at that trial number (top panels) or at that minute into the session (bottom panels). Plotting within-session response dynamics in this way makes it easy to visually and statistically determine whether the session statistic is not (left panels) or is (right panels) a good description of responding for that client in that session. To summarize, behavior analysts collect a lot of data every day on how someone responds within each session. Often, we may simply plot the aggregate statistic we commonly see in the pages of the Journal of 3
Many researchers have examined within-session response dynamics (e.g., Banna & Newland, 2009; DeMarse et al., 1999; McSweeney & Swindell, 1999; Reilly & Lattal, 2004; Tonneau et al., 2006). We are certainly not the first to look at data in this way. Though we are unaware of researchers who regularly use this approach to make claims about whether the session level statistics they choose to use are valid representations of data from a session.
How fast can I get to an answer? Sample size, power, and observing behavior
183
Table 7.1 Primary function of asking how many observations do we need based on the level of observation we are considering and common methods for identifying an answer. Observation
Function of answering, “When should we
level
stop?”
Response
Is my session level description of responding accurate?
• Within-session response analyses (statistical and visual analysis of trend, level, variability). • Select central tendency, variability, and stability measures to match data type and session data response distribution.
Session
Is my condition level description of responding accurate?
• Within-condition response analyses (statistical and visual analysis of trend, level, variability). • Select central tendency, variability, and stability measures to match the data type and condition data response distribution.
Participants
Am I accurately describing all potential participant characteristics that might impede someone else from getting the same (or bigger) effect sizes with future clients?
• Robust pre- and post-intervention assessment procedures to capture developmental and behavioral abilities. • Robust demographic and social determinants of health measurement systems to capture larger social and economic variables that might impact behavior. • Publishing these data alongside your research paper.
Common solutions
Applied Behavior Analysis or based on our training without checking whether the aggregate statistic we use is an accurate and stable measure of that specific client’s behavior, or whether we need to collect more data. By plotting within-session patterns of responding, we can answer the question central to when observations 5 responses. That question, “Is my session level description of responding accurate?” (Table 7.1). To answer this question using within-session plots of responding, we can simply use the visual analysis and statistical tools we are familiar with when analyzing behavior across sessions.
Observations 5 sessions Behavior analysts are likely most familiar with analyzing the number of observations as defined as the number of sessions. Here, the question of interest often is, “How many sessions do I need to conduct to accurately capture the effect of my intervention on behavior?” (Table 7.1) From the visual analytic camp, the answer is short and sweet. You can stop collecting data within a condition “once responding is stable” or, as described by Sidman, 1960; when a “steady state”
184
Statistics for Applied Behavior Analysis Practitioners and Researchers
is achieved. That is, when we consider the trend to be “near zero,” and the level and variability (aka, “bounce”) of responding to “be consistent and near constant” (Cooper et al., 2019)4. The logic involves three steps. First, you control experimental and extraneous variables well enough to provide a context for stable responding to occur. Second, you continue to collect data until stable responding is achieved and recognized. Third, once stable responding is achieved and recognized, you move to the next condition, consider the target response mastered, end the functional analysis, or do the next voodoo that you do so well. But what does “stable responding” actually mean? Perhaps, we can lean into the famous definition of hardcore pornography proffered by Stewart in 1964, “I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description, and perhaps I could never succeed in intelligibly doing so. But I know it when I see it” (Jacobellis v. Ohio, 1964). That is, maybe a definition of “stable responding” is superfluous. A “good” or “competent” behavior analyst simply can tell when responding is stable vs. when it is not. Easy peasy, end of story. But, visual analysis is unlikely to be as reliable, valid, or precise of a measure as we might need in some situations. After all, past researchers have repeatedly shown that reasonable people disagree about how to interpret functional analyses (e.g., Danov & Symons, 2008; Saini et al., 2018), whether there was an intervention effect (e.g., DeProspero & Cohen, 1979; Ford et al., 2020; Kahng et al., 2010; Matyas & Greenwood, 1990), and when to continue an intervention vs. change something about an intervention program (e.g., Cox & Brodhead, 2021; Saini et al., 2018). Why would decisions about “response stability” be any different? Researchers working in the experimental analysis of behavior, however, have worked around such lack of specificity in “response stability” by offering explicit statistical definitions or rules (aka, stability criteria) to guide their decision-making. For example, Sidley and Schoenfeld (1964) suggested 5-day blocks of responding via the arithmetic mean and 95% confidence intervals could be used to evaluate a 4 For the few readers interested in behavioral dynamics as opposed to responding only once stable, the criterion would likely be similar. The broader, generalized answer is to collect data until the phenomena you’re after has been adequately measured. If you’ve reached stable responding, the “dynamics” part is likely over and you have what you’ll have for your question at hand.
How fast can I get to an answer? Sample size, power, and observing behavior
185
stability criteria. As a second example, Cumming and Schoenfeld (1959) automatically threw out the data of first seven days and, using data from the next six days, compared the mean of the first three days to those of the last three days and continued adding days until the mean differences (Chapter 6) were within 5% of each other. As a more recent example, Carlson and Weiner (2023) use a binomial test of significance to identify whether responding was better than chance to suggest acquisition had occurred. For examples of other stability criteria and related discussion, inquisitive readers are directed to Perone (1991) and Sidman (1960). The point here isn’t that any one statistical definition is better than the other. In fact, how we choose to define stability may end up influencing how many sessions we have to run until we get an “accurate” measure of responding (e.g., Killeen, 1978). Rather, the point is that adding a statistical definition of response stability creates a replicable method to claim that responding has reached stability that anyone, anywhere, at any point in time can use to arrive at the same answer. For the sake of this section of this chapter in this book, what’s most important is to explicitly define what your stability criterion is and then to stick to it. And, at a more general level, the direct answer to how “best” to define stability criteria follows from everything else in the book to this point. The stability criteria should be defined such that your measure of central tendency and variability within a condition accurately describes your data and is consistent within a tolerable range of variability (Table 7.1). So, how does one go about turning that general answer into explicitly defined stability criteria? This is a great question and, probably no surprises here, depends on the environment-response relations you’re measuring, how accurate you want to be, and the cost trade-off in the number of sessions you can afford to run before you need to make a decision and take action. For example, behavior analysts working in applied settings have less environmental control and the latency to identify and implement successful interventions is critical. For them, stability criteria that allow greater variability and a greater degree of trend might be okay. In contrast, researchers working in laboratory settings with greater environmental control and arbitrary responding can likely afford to have stricter stability criteria and run a greater number of sessions. In any event, given how seldom strict criteria are
186
Statistics for Applied Behavior Analysis Practitioners and Researchers
published in the literature, our current recommendation is simply for behavior analysts to begin making explicit their quantitative definition of response stability so we can begin to identify what criteria are best depending on the context, variables under study, and goals of the researcher or practitioner.
Observations 5 participants As we continue to scale our focus outward, the final level of observations commonly discussed for experimental work are when the number of observations equates to the number of participants in a study. It is likely this level of observation that most people immediately think of when the topic of “sample size” is raised. We purposefully left it for the end in this chapter because of the chain of decisions that need to be made before you can even get to this question. Specifically, you first have to decide how you will ensure your aggregate measure of withinsession responding accurately describes the behavior of interest and, ideally, what statistical cutoff makes you feel comfortable that it does. Second, you have to decide how you will ensure your aggregate measure of within-condition responding or between-group responding accurately describes the behavior of interest and, ideally, what statistical cutoff makes you feel comfortable that responding is “stable.” Lastly, once you have pinned down both of those decisions in a defensible way, you get to decide how many participants you need to recruit5. How we answer this question also raises a historical issue prominent in the wars of psychological study and what are claimed to be the “best” research practices. If you’re reading this book, our hunch is that you likely have a good handle on the differences between withinsubject research designs (sometimes called single-case experimental designs, SCEDs) and group-subject research designs. Since you’ve made it this far, we also have a hunch that you could guess Jason and David believe the supposed historical riff is a false dichotomy and fueled more by rhetoric and chest thumping than logic and fact. Ironically, almost all researchers that we chat with feel similarly. So, 5 NB: This section is likely irrelevant to those who only work in practice settings. However, the content might be worthwhile to read for when you consume and evaluate publications where these questions are relevant or if you start to engage in modeling of patient outcomes or organizational processes a la Chapter 6.
How fast can I get to an answer? Sample size, power, and observing behavior
187
perhaps, someday this false dichotomy will find its way out of the pages of our textbooks. In any event, SCEDs and group design methodologies each have their strengths and weaknesses (Cox, 2016; Epstein & Dallery, 2022; Fisher et al., 2011; Roane et al., 2011). SCEDs often have greater internal validity than group designs. This provides the researcher with greater confidence that they have correctly identified the environmental variables that lead to the behavior observed for the participants in their study. However, multiple control of behavior is the rule and every human has a unique developmental6 and learning history when entering an intervention. Thus SCEDs by themselves are often less well equipped to make claims about how the results are likely to generalize to other individuals within the larger population7. That is, SCEDs often have low external validity8. Here we’ll quickly note that science is a collective phenomenon. External validity from SCEDs can be enhanced and accumulated via systematic replications over time (Walker & Carr, 2021).
All humans are biological organisms. For biological reasons, we all somehow “know” how to grow arms and legs and lungs and tongues. We “know” how to temper the growth of these things once they reach a certain relative relation to our larger body (e.g., some people don’t accidentally grow ears the size of their legs). And, we know when to stop growing these things altogether once our body reaches a certain overall size (our adult size). Further, unless we posit that the “controller” of our repertoires and behavior exists outside the physical realm, then “learning” necessarily involves some kind of change in our biology which—as we noted—is subject to continuous developmental changes throughout our life span (e.g., we continue to age once we reach our adult size). And, lastly, we also often work with individuals who have unique developmental patterns such that their rate of learning and overall repertoires are different from those not requiring our support. Thus, though the intersection between biological development, aging, and behavior analysis is understudied and underdiscussed, the logical interaction between these two areas cannot be ignored when considering the external validity of our research. 7 NB: We’re not arguing that the same behavioral processes likely wouldn’t be at play with all members of a population. But the size of the effect is likely going to be different based on each person’s unique learning history and current behavioral repertoire. Differential effect sizes are likely not relevant for individuals in the population for whom the effect size is larger than in the SCED study. However, for those for whom the effect size reduces below a statistical or socially significant level, SCEDs often don’t provide the information that readily allows you to know what variables are the ones interacting and impeding the intervention effect. And a fair question for readers of research is whether my client(s) will fall in the “no effect” category or the “larger effect” category. This is a nontrivial question when we want to use the tens of thousands of dollars a year and 30 1 hours per week of someone’s time to implement a suite of interventions. 8 Behavior analysts commonly use the terms generality (not to be confused with Baer et al., 1968’s usage of the same term) and generalization (not to be confused with stimulus or response generalization) in lieu of the term external validity (Branch & Pennypacker, 2013; Johnston et al., 2019). 6
188
Statistics for Applied Behavior Analysis Practitioners and Researchers
In contrast, group designs often have much greater external validity than SCEDs. This provides the researcher with much greater confidence of the magnitude and variability of the intervention effect that individuals in the larger population are likely to experience. Further, if done well, these studies can also highlight the unique learning and developmental histories that coincide with differential effects of the intervention. However, group designs often collect only one or a few observations from many participants. Thus group designs are often less well equipped to make claims about the behavioral processes and mechanisms controlling behavior. This paragraph has a very important point you shouldn’t skip over9. In both the previous paragraphs, we used the words “often” when describing the strengths and weaknesses of SCEDs and group design studies. This is because there is nothing inherent to SCEDs or group design studies that automatically makes them more internally or externally valid than all studies conducted using the other approach. Internal validity in behavior analytic research involves collecting enough observations on behavior-environment relations to demonstrate a likely10 causal relation. For our purposes, external validity requires you to collect data from enough participants to have sufficiently sampled the many different characteristics likely to multiply control behavior. It’s often practical constraints that stop someone from running 1000 participants through a within-subjects research design. And, it’s often practical constraints that stop someone from collecting repeated measures in group design studies. Stated differently, your choice of research design does not require internal and external validity to be mutually exclusive ideas. You might be asking, “What is the relevance of all this reflection on internal and external validity?” In short, the reason we’re spending time here is because of the function of asking how many participants we need in the first place (Table 7.1). If we are asking questions around how many participants (observations) we need, then we are 9
What a great topic sentence, eh!? Often left out of these conversations is the logical fact that no SCED can technically ever demonstrate a causal relation. Until we can measure everything, everywhere, and all at one, each within-subject replication only reduces the likelihood that confounds influenced our results. This likelihood gets nearer and nearer zero the more observations we collect within each condition and the more replications we run. But it is still impossible to ever be certain that we have removed the possibility of any confounds. 10
How fast can I get to an answer? Sample size, power, and observing behavior
189
likely in the land of caring about external validity. Questions around external validity can come in many different flavors, though there are two common questions researchers are often interested in group designs. And, in both situations, you will likely need to conduct a power analysis to determine how many participants are needed based on the analytic strategy you have chosen to use in your study. A power analysis is a standardized method for calculating the probability that you will measure an intervention effect if one actually exists. The logic behind the calculations relies on everything we have discussed to this point in the book with respect to probability distributions and data types, how we can best describe our data, and how we choose to measure a meaningful effect via statistical significance, effect size, or social significance. Fortunately, many freely available software packages do the heavy lifting for you. Our favorite is G Power (Faul et al., 2007, 2009). To use it, all you need to do is to have some specific information handy and enter it into the software interface, and— voila—it spits out how many participants are needed to ensure you detect an effect with the chosen probability (see Fig. 7.3 for example outputs for within-subject and group design methodologies). Exactly what information you need to have handy depends on which of the two common questions you are interested in asking about relative to external validity.
Is the intervention effect I’m seeing real? The first common question around external validity is whether the results we get are real or might have occurred randomly with the people who happened to enroll in our study. Framing the question slightly differently, we often want to know how many participants we would need to recruit to detect an intervention effect if an effect size we’re interested in actually exists? Remember, the goal of this is external validity. So after running the calculated number of participants, we are either on to something (we observed an effect as large or larger than desired) or we likely need to rethink what we’re doing (we observed an effect smaller than desired or no effect at all). To actually make these calculations, the number of participants (i.e., observations) needed to answer this question depends on how we choose to measure our intervention effect. If you recall back to
190
Statistics for Applied Behavior Analysis Practitioners and Researchers
Figure 7.3 Example G Power outputs for a within-subject study (top panel) and group design study (bottom panel).
Chapter 5, this involves making some decisions. We need to identify which statistical test or measure of effect size we want to use. Or, if we’re using a measure of social significance, we will want to turn that into an effect size. Relatedly, we also need to identify the alpha value we are comfortable with (often p 5 0.05), the probability of detecting
How fast can I get to an answer? Sample size, power, and observing behavior
191
an effect (often β 5 0.20 so power 5 0.80), and, depending on the statistical test, other parameters that might be needed to calculate the statistics (e.g., sample size differences for Pearson’s r, number of covariates for analysis of covariance [ANCOVA]). Once entered, the software package will return how many participants are needed to detect an effect—if one exists—based on the likely effect size chosen and the desired power. There are two important things to note about the power analyses in Fig. 7.3. First, before a study is even conducted, notice the amount of information you need relative to predicted patterns of responding during baseline and intervention. Power analyses (and statistics in general) use a much greater level of precision when describing and predicting behavior than behavior analysts are used to. Rarely are behavior analysts required to make precise predictions before an intervention as to the exact level that behavior will increase or decrease toward as a function of an intervention. Though we have no data to support this claim, we suspect that thinking through behavior-environment relations with an eye toward such precision will make us better behavior analysts in the long run. Second, notice how switching from a within-subject to a betweengroup design increased the total number of participants by only a trivial amount from four to six. This highlights what Jason and David have been saying throughout the book. The effect sizes we see in behavior analysis are often huuuuuuge. With just a tiny bit of extra work via conducting a power analysis, running a few extra participants, and adding statistics to manuscripts, behavior analysts can speak to those outside our field in a language they are familiar with and disseminate our goods and wares more broadly and to greater effect.
How do I know if I’ve accounted for the variables’ multiply controlling behavior? The second common question around external validity is how different characteristics of the participants might influence, or allow us to describe, the relation among those multiple characteristics, our intervention, and the intervention effect. Here, we are likely using some kind of multivariable modeling approach (Chapter 6). Fortunately,
192
Statistics for Applied Behavior Analysis Practitioners and Researchers
many of the same power analysis software packages will also tell you how many participants are needed when using a regression or classification modeling approach. The only difference from the previous paragraph is the type of information you need to come ready with. As you likely guessed based on Chapter 6, you would need information such as how many IVs are going into the model, the relationship between those variables, and the data types and distributions of those IVs. But, in the end, you plug and chug your information into the software interface to get the answer you need to the question: How many participants do I need to detect a differential influence of my IVs on behavior if I take a modeling approach to my analyses?. One important consideration for these kinds of external validity questions is that the number of participants begins to increase quite substantially as we add more variables to our model. For example, if we were to conduct a multivariable linear regression with five predictor variables (e.g., baseline assessment scores and demographics) and Cohen’s f2 effect size of 0.3, we would need a sample size of 49. The sample size jumps to 64 participants if we include 10 predictor variables and 87 participants with predictor 20 variables. Given the resource constraints most SCED researchers are under, attaining this kind of external validity is often not feasible, at least in any single study. Further, simply increasing sample size does not automatically equate to improved external validity. The sample size challenge of modeling multiple predictor variables on intervention effect raises an important point. We know multiple control is the rule rather than the exception. We can’t simply ignore the fact that many things are likely interacting to influence the size of our intervention effect from client to client. Thus the impracticalities of recruiting large sample sizes for SCED research underscore the importance of researchers publishing client characteristics in their manuscripts. Past researchers have consistently observed that demographic variables are often underreported in behavior analytic journals for specific demographic variables, specific populations, and specific interventions (e.g., Jones et al., 2020). By simply adding these variables to manuscripts, over time we can amass larger sample sizes with information about potentially relevant developmental and behavioral client characteristics that influence the direction and magnitude of our interventions. In turn, researchers can use this information to predict and improve client outcomes proactively and at scale (Cox et al., 2023).
How fast can I get to an answer? Sample size, power, and observing behavior
193
When to decide when to stop: Variations on a theme So far, we have jumped around a bit in terms of when you make the decision to stop collecting data. We presented these decisions primarily around how these decisions are likely to unfold in day-to-day practice and research (Table 7.1). When observations 5 responses (i.e., data collection within a session), we described decisions in a post-hoc manner. That is, you would likely run analyses after the session is over to make sure you plot an accurate statistical measure on your graph (e.g., median, arithmetic mean, mode)11. When observations 5 sessions (i.e., data collected within a condition), we also described post-hoc decisions. That is, you conduct a session, plot the new (and accurate!) statistical aggregate of responding, then run analyses to see if your data meet your explicitly defined criteria for “stable responding,” then stop or continue running sessions based on the results. Lastly, when observations 5 participants (i.e., data collected within a study), we described decisions in an a priori manner. That is, you would run a power analysis to determine how many participants are likely needed based on the methodology and analytic strategy you plan to use. But, you don’t have to think about these analyses only in the manner described previously. For example, you could conduct power analyses to determine how many responses are needed within a session or within a condition a priori based on previous within-session response patterns common to baseline and intervention. Similarly, maybe you can only afford to recruit a certain number of participants and want to know what the alpha and power would be under those conditions. Here, a compromise power analysis is your friend. Or, perhaps, after you have run a study you want to know what your actual alpha and power are. Here, you can use criterion power analysis or a post-hoc power analysis to get the information you need12. And, finally, sensitivity power analyses tell you what the smallest effect size your number of observations can actually detect which could technically be run a priori, in the middle of session, condition, or study; or, after data 11
Relatedly and where possible, we also hope the content in this chapter leads behavior analysts to begin to include “error bars” on each marker showing the variability of responding within each session. As a reader, we know it’s there. Why not help us better understand exactly what responding looked like in that session? 12 A criterion power analysis is an alternative to a post-hoc power analysis when controlling beta is more important than controlling alpha. But, both are technically done after the study has been completed.
194
Statistics for Applied Behavior Analysis Practitioners and Researchers
collection has stopped. Again, fortunately for us, software packages do the heavy calculation work for us. You just have to know the data type, data distribution, and the analyses you plan to run based on whether your observations are responses, sessions, or participants. After that, you can ask these questions whenever you want.
Chapter summary Data collection in behavior analytic research and practice involves a trade-off between accuracy and resources. We want to collect enough data such that we get an accurate picture of behavior-environment relations, but we don’t want to waste time and resources and collect “more than is necessary” just for kicks-and-giggles. A fair reflection then becomes, when is enough data collection, enough? That is, how do we know we have adequately balanced accuracy and resource use? In this chapter, we reviewed when and how behavior analysts might make these decisions when observation is at the levels of responding, sessions, and participants. And, we reviewed how our strategy likely depends on the data type, data distribution, data analyses we will conduct, and whether we’re focusing on internal or external validity of our data-based claims. Finally, we closed by highlighting that the statistical tools and techniques to answer this question can really be conducted whenever we want to have an answer be it a priori, in medio, or posthoc (i.e., before, during, or after a study). To round out this chapter, we’d like to reorient ourselves within the statistical landscape in which we are hiking. We began our journey by describing what statistics are and that behavior analysts use statistics on a daily basis even if they don’t call them that (Chapter 1). We then described what data types and data distributions are (Chapter 2) and how these influence our measures of central tendency (Chapter 3) and variability (Chapter 4). Often, behavior analysts are interested in using this information to compare behavior across two or more different conditions which can be accomplished—statistically—in many different ways (Chapter 5). Lastly, sometimes we have many different levels of our IV or many different IVs and we want to precisely describe or predict the relation between those IVs and behavior (Chapter 6). Accuracy has been the primary focus throughout Chapters 1 6. We want to make sure our data accurately reflects the behavior we are
How fast can I get to an answer? Sample size, power, and observing behavior
195
interested in, accurately captures the effect of our intervention, and accurately accounts for the many sources of multiple control. We know that the more data we collect, the more accurately we can capture what we are interested in. But, unfortunately, no one has unlimited time and resources and stakeholders often want to know how long we will take to do what we need to do. This chapter reviewed common strategies for determining when we can stop and move on. Coincidentally, writing chapters in a book involves similar tradeoffs. You need to cover enough material such that the chapter is accurate and comprehensive enough. But, no single chapter can cover everything and we have to know when to move on. For this chapter, that time is now. In the next chapter, we pivot to an additional variable so common to behavior analysis research and practice that we often forget it is technically an IV. In an exciting twist, however, we have no control over this little monster despite it wreaking havoc on everything we have talked about leading to this point in the book. That little monster is “time.”
References Baer, D. M., Wolf, M. M., & Risley, T. R. (1968). Some current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis, 1(1), 91 97. Available from https://doi.org/ 10.1901/jaba.1968.1-91. Banna, K. M., & Newland, M. C. (2009). Within-session transitions in choice: A structural and quantitative analysis. Journal of the Experimental Analysis of Behavior, 91(3), 319 335. Available from https://doi.org/10.1901/jeab.2009.91-319. Branch, M. N., & Pennypacker, H. S. (2013). Generality and generalization of research findings. In G. J. Madden (Ed.), Handbook of behavior analysis (1, pp. 151 175). American Psychological Association. Carlson, H. N., & Weiner, J. L. (2023). The maladaptive alcohol self-administration task: An adapted novel model of alcohol seeking with negative consequences. Journal of the Experimental Analysis of Behavior. Available from https://doi.org/10.1002/jeab.834, ePub ahead of print. Cook, J. E., Subramaniam, S., Brunson, L. Y., Brunson, L. Y., Larson, N. A., Poe, S. G., & St. Peter, C. (2015). Global measures of treatment integrity may mask important errors in discretetrial training. Behavior Analysis in Practice, 8, 37 47. Available from https://doi.org/10.1007/ s40617-014-0039-7. Cooper, J. O., Heron, T. E., & Heward, W. L. (2019). Applied behavior analysis (3rd ed.)). Pearson, ISBN: 978-0134752556. Cox, D. J. (2016). A brief overview of within-subject experimental design logic for individuals with ASD. Austin Journal of Autism & Related Disabilities, 2(3), 1025. Cox, D. J., & Brodhead, M. T. (2021). A proof of concept analysis of decision-making with timeseries data. The Psychological Record, 71(3), 349 366. Available from https://doi.org/10.1007/ s40732-020-00451-w.
196
Statistics for Applied Behavior Analysis Practitioners and Researchers
Cox, D.J., D’Ambrosio, D., & Rethink First Data Team (2023). An artificial intelligence driven system to predict ASD outcomes in ABA. PsyArXiv. Available from https://doi.org/10.31219/osf.io/3t9zc. Cumming, W. W., & Schoenfeld, W. N. (1959). Some data on behavior reversibility in a steady state experiment. Journal of the Experimental Analysis of Behavior, 2(1), 87 90. Available from https://doi.org/10.1901/jeab.1959.2-87. Danov, S. E., & Symons, F. J. (2008). A survey evaluation of the reliability of visual inspection and functional analysis graphs. Behavior Modification, 32(6), 828 839. Available from https:// doi.org/10.1177/0145445508318606. DeMarse, T. B., Killeen, P. R., & Baker, D. (1999). Satiation, capacity, and within-session responding. Journal of the Experimental Analysis of Behavior, 72(3), 407 423. Available from https://doi.org/10.1901/jeab.1999.72-407. DeProspero, A., & Cohen, S. (1979). Inconsistent visual analyses of intrasubject data. Journal of Applied Behavior Analysis, 12, 573 579. Available from https://doi.org/10.1901/jaba.1979.12-573. Epstein, L. H., & Dallery, J. (2022). The family of single-case experimental designs. Harvard Data Science Review (Special Issue 3). Available from https://doi.org/10.1162/99608f92.ff9300a8. Faul, F., Erdfelder, E., Buchner, A., & Lang, A. G. (2009). Statistical power analyses using G Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41, 1149 1160. Available from https://doi.org/10.3758/BRM.41.4.1149. Faul, F., Erdfelder, E., Lang, A. G., & Buchner, A. (2007). G Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods, 39, 175 191. Available from https://doi.org/10.3758/BF03193146. Fisher, W. W., Groff, R. A., & Roane, H. S. (2011). Applied behavior analysis: history, philosophy, and basic methods. In W. W. Fisher, C. C. Piazza, & H. S. Roane (Eds.), Handbook of applied behavior analysis (pp. 3 13). The Guilford Press. Ford, A. L. B., Rudolph, B. N., Pennington, B., & Byiers, B. J. (2020). An exploration of the interrater agreement of visual analysis with and without context. Journal of Applied Behavior Analysis, 53(1), 572 583. Available from https://doi.org/10.1002/jaba.560. Guo, P. (2011). One-shot decision theory. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans, 41(5), 917 926. Available from https://doi.org/10.1109/ TSMCA.2010.2093891. Guo, P., & Li, Y. (2014). Approaches to multistage one-shot decision making. European Journal of Operational Research, 236(2), 612 623. Available from https://doi.org/10.1016/j. ejor.2013.12.038. Jacobellis v. Ohio. (1964). First Amendment Library. Retrieved from: https://www.thefire.org/ supreme-court/jacobellis-v-ohio. Johnston, J. M., Pennypacker, H. S., & Green, G. (2019). Strategies and Tactics of Behavioral Research and Practice (4th Ed.). Routledge. Jones, S. H., St. Peter, C. C., & Ruckle, M. M. (2020). Reporting of demographic variables in the Journal of Applied Behavior Analysis. Journal of Applied Behavior Analysis, 53(3), 1304 1315. Available from https://doi.org/10.1002/jaba.722. Kahng, S. W., Chung, K. M., Gutshall, K., Pitts, S. C., Kao, J., & Girolami, K. (2010). Consistent visual analyses of intrasubject data. Journal of Applied Behavior Analysis, 43, 35 45. Available from https://doi.org/10.1901/jaba.2010.43-35. Killeen, P. R. (1978). Stability criteria. Journal of the Experimental Analysis of Behavior, 29(1), 17 25. Available from https://doi.org/10.1901/jeab.1978.29-17. Lerman, D. C., Dittlinger, L. H., Fentress, G., Fentress, G., & Lanagan, T. (2011). A comparison of methods for collecting data on performance during discrete trial teaching. Behavior Analysis in Practice, 4, 53 62. Available from https://doi.org/10.1007/BF03391775.
How fast can I get to an answer? Sample size, power, and observing behavior
197
Matyas, T. A., & Greenwood, K. M. (1990). Visual analysis of single-case time series: Effects of variability, serial dependence, and magnitude of intervention effects. Journal of Applied Behavior Analysis, 23(3), 341 351. Available from https://doi.org/10.1901/jaba.1990.23-341. McSweeney, F. K., & Swindell, S. (1999). Behavioral economics and within-session changes in responding. Journal of the Experimental Analysis of Behavior, 72(3), 355 371. Available from https://doi.org/10.1901/jeab.1999.72-355. Meany-Daboul, M. G., Roscoe, E. M., Bourret, J. C., & Ahearn, W. H. (2007). A comparison of momentary time sampling and partial-interval recording for evaluating functional relations. Journal of Applied Behavior Analysis, 40(3), 501 514. Available from https://doi.org/10.1901/ jaba.2007.40-501. Merriam-Webster (2021). Statistics. Retrieved from the website: https://www.merriam-webster. com/dictionary/statistics. Perone, M. (1991). Experimental design in the analysis of free-operant behavior. In I. H. Iverson, & K. A. Lattal (Eds.), Experimental Analysis of Behavior, Part 1 (pp. 135 171). Elsevier. Reilly, M. P., & Lattal, K. A. (2004). Within-session delay-of-reinforcement gradients. Journal of the Experimental Analysis of Behavior, 82(1), 21 35. Available from https://doi.org/10.1901/ jeab.2004.82-21. Roane, H. S., Ringdahl, J. E., Kelley, M. E., & Glover, A. C. (2011). Single-Case Experimental Designs. In W. W. Fisher, C. C. Piazza, & H. S. Roane (Eds.), Handbook of Applied Behavior Analysis (pp. 132 147). The Guilford Press. Saini, V., Fisher, W. W., & Retzlaff, B. J. (2018). Predictive validity and efficiency of ongoing visual-inspection criteria for interpreting functional analyses. Journal of Applied Behavior Analysis, 51(2), 303 320. Available from https://doi.org/10.1002/jaba.450. Sidley, N. A., & Schoenfeld, W. N. (1964). Behavior stability and response rates as functions of reinforcement probability on “random ratio” schedules. Journal of the Experimental Analysis of Behavior, 7(3), 281 282. Available from https://doi.org/10.1901/jeab.1964.7-281. Sidman, M. (1960). Tactics of scientific research: Evaluating experimental data in psychology. Basic Books, ISBN: 978-0962331107. Tonneau, F., Rios, A., & Cabrera, F. (2006). Measuring resistance to change at the withinsession level. Journal of the Experimental Analysis of Behavior, 86(1), 109 121. Available from https://doi.org/10.1901/jeab.2006.74-05. Walker, S. G., & Carr, J. E. (2021). Generality of findings from single-case designs: It’s not all about the “N.”. Behavior Analysis in Practice, 14, 991 995. Available from https://doi.org/ 10.1007/s40617-020-00547-3.
CHAPTER
8
Wait, you mean the clock is always ticking? The unique challenges time adds to statistically analyzing time series data A study of economics usually reveals that the best time to buy anything is last year. Marty Allen
Introduction Time is a fickle friend. It runs our lives as a dictator via alarm clocks, weekly calendars, birth, and death. It’s always and never on your side. It moves at its own stubbornly consistent pace despite our pleas to slow down or speed up. And, when you try to pin it down for measurement, you can’t move too fast or be too large or your measurement will differ from the measure collected by someone else moving slower or who lives on a more massive planet (e.g., Cohn, 1904; Einstein, 1905). The result is, perhaps, our favorite operational definition of any variable in the sciences: time is what a clock reads (Ivey & Hume, 1974). In short, we have no clue what time really is despite its central role in governing our lives and our sciences. Time also plays a prominent role in behavior analysis. For example, we analyze the number of responses a client emits per unit of time (rate of responding); we analyze the time over which responding occurs (duration); we analyze how long it takes for someone to respond after presenting a stimulus (latency); we get paid to provide services based on what behaviors someone “should” be able to emit given the time they have spent alive (i.e., their age); and we analyze intervention effects by comparing responses between conditions that each occurred over different swaths of time (within-subject experimental designs). Time is so common, in fact, that at the time of writing, 94% of the experimental articles in the most recent issue of the Journal of Applied Statistics for Applied Behavior Analysis Practitioners and Researchers. DOI: https://doi.org/10.1016/B978-0-323-99885-7.00006-4 © 2023 Elsevier Inc. All rights reserved.
200
Statistics for Applied Behavior Analysis Practitioners and Researchers
Behavior Analysis (15 of 16 articles in Volume 55, Issue Number 4) had time as one of the variables graphed or as a planned component of their independent variable (IV). Despite the common presence of time in our analyses, analyzing time series data visually is challenging. If you’re reading this book, you are likely aware of the three fundamental characteristics of all time series graphs in trend, level, and variability1 (e.g., Cooper et al., 2019). Though visual analysis of these three characteristics is often described simply (e.g., “Just look and see if each changed or not between conditions.”), any new behavior technician or behavior analyst in training will likely tell you that it can be a challenging skill to learn. And, the difficulty of learning this skill can lead to nontrivial disagreement between any two behavior analysts about whether behavior changed or not (e.g., DeProspero & Cohen, 1979; Ford et al., 2020; Kahng et al., 2010; Matyas & Greenwood, 1990) and how to proceed as interventions continue (e.g., Cox & Brodhead, 2021). As a result of the challenges and sometimes inconsistencies in visually analyzing time series data, smart people throughout time(ha!) have attempted to determine how we can use statistics to describe time series data. But, statistically describing time series data is no easy feat as time adds a wrinkle to everything we have talked about in the book. Time can easily be converted to a number (Chapter 1). But the distribution (Chapter 2) it takes in our analyses can be uniform (e.g., cumulative record), normally distributed (e.g., latency to respond during reaction time tests), Poisson distributed (e.g., time it takes for a bus to arrive and of which a client might need to learn to wait), or of another exotic variety. Lastly, we often combine our measure of time with measures of something else (e.g., counts of responses). The result is a primary unit of analysis involving a data type that is jointly determined by two or more data types and the questions we are trying to answer. Time is also an IV in itself which can impact measures of central tendency (Chapter 3) and variability (Chapter 4). Each of us is always getting older, more mature2, developmentally advancing, and changing our behavior as we learn and adapt to an ever changing environment around us. This expectation that our behavior is always changing in 1
But, wait! There’s more! See Chapter 9 for additional characteristics of time series data seldom discussed in behavior analysis. 2 Well, some of us, anyway. David and Jason are sure trying their darndest.
Wait, you mean the clock is always ticking
201
some capacity impacts measures of central tendency and variability. Measuring central tendency can be challenging because it can depend on how far back in time you look such as with a trending data path. It can also make measuring variability challenging as variability might decrease with improved fluency over time, or variability might increase as response or stimulus competition are added to a dynamic environment over time. An important question then becomes how much of someone’s learning history should be included to accurately describe and predict behavior? Time also adds a wrinkle to calculating statistical significance, effect sizes, and social significance (Chapter 5). Statistical significance requires the measures of central tendency and variability to come from a single distribution so they can be estimated with a parameter (parametric inferential statistics) or requires that the data across conditions being compared are unchanging relative to the other (nonparametric inferential statistics). Effect sizes are differences between two data distributions which become hard to make claims about if things are constantly changing. And, social significance involves the most recent (or handful of most recent) data points and the assumption things will only get better or at least maintain. As a result of the challenges mentioned above, smart people have attempted to develop more advanced methods for precisely and quantitatively describing—in tandem—all three characteristics of time series data. That is the purpose of this chapter. To show you the common methods whereby behavior analysts can use statistics to describe time series data while trying to account for trend, level, and variability. Throughout, we’ll also describe the assumptions needed to use each method and the conditions under which each performs better or worse. Just like tools such as hammers, screwdrivers, and lighters, none of the methods described in this chapter are inherently better or worse than the others, they simply just are. By the end of this chapter, you should have a good sense of each of these methods, the specific jobs for when each tool is likely useful, and when the context calls for use of a different tool. One final note before we get to the meaty stuff, a fair question is why behavior analysts should learn the approaches to statistical analysis of time series data if visual analysis has served them fine so far. We think there are at least three good reasons to do so. First, much of the
202
Statistics for Applied Behavior Analysis Practitioners and Researchers
rest of the world speaks in the language of statistics. The following is something reasonable people will likely disagree about, but we believe that efficient and effective collaboration and communication requires us to speak to our audience where they are rather than forcing them to adapt to us. It doesn’t mean we can’t educate others about the benefits to the approaches traditionally taken by behavior analysts. But education takes time. Listeners may only read the title and abstract of your article (if they even know it exists), or they may have 30 seconds to hear and make a decision about the quality and effectiveness of your applied work. Communicating in a language they are familiar with allows them to more easily appreciate the (often massive) effects of our interventions. Second, the research cited previously suggests that visual analysis is not an easy skill to learn and reasonable people will sometimes disagree about whether behavior changed as a function of our intervention. Yes, we know because we also cited them above, researchers have conducted very nice experiments showing that journal editors and behavior analysts with decades of experience will agree in most instances as to whether an intervention effect occurred when the data come from highly controlled experimental settings (Dowdy et al., 2022). But, the majority of the hundreds of thousands of behavior analysts out in the wild are not journal editors, are likely to have been in the field for less than five years (BACB, n.d.), and make decisions daily with data collected in significantly (statistically and socially) less controlled environments. We believe it can’t hurt to have tools that supplement visual analysis and that, at minimum, provide an identical method for everyone, everywhere, regardless of experience and training, to calculate the same result with the same dataset, every single time. Lastly, behavior analysis is maturing. For example, compare Skinner’s simple single schedule experiments with rats in operant chambers to any recent article on concurrent chains to assess intervention preference. In the words of a wise colleague who shall remain unnamed, “The low hanging fruit has likely already been picked.” Translation: the simple research questions, experiments, and interventions that cause large changes in behavior have largely been conducted. This doesn’t mean that past researchers had it easy. We suspect it was quite difficult to start an entire new branch of science. But, as we
Wait, you mean the clock is always ticking
203
move forward, this observation does suggest that we will need to be increasingly smarter and more skilled at manipulating the environment and observing smaller changes in behavior. A shift toward nuance and complexity occurs as every science matures (e.g., large hadron collider). At some point, it’s possible that the degree of level, trend, or variability observed start to become small enough or involve a complex array of so many IVs that a complete visual analysis becomes nearly impossible. We already saw a bit of this in Chapter 6. Statistics help us parse such complexity and nuance and—again—in a manner that anyone, anywhere, regardless of background and skill, can replicate with any dataset at any point in time. This seems like it has face utility3.
Statistical analysis of time series data for single-case designs To remind the reader of our foundational definition of statistics, statistics is a branch of mathematics whose topic is the collection, analysis, interpretation, and presentation of aggregate quantitative data (Merriam-Webster, 2021). As you might imagine, there are many ways that we might quantitatively aggregate, analyze, and interpret data that has all the fun characteristics of time series data. Historically, behavior analysts have seemingly taken one of at least five approaches when attempting to statistically analyze time series data. These include variations on: structured criteria, overlapping statistics, effect sizes, regression modeling, and “nested” modeling approaches (e.g., multilevel modeling, hierarchical modeling). In the following sections, we review the basic idea, assumptions, and questions that each approach takes. As with everything else in this book, a full and critical treatment of all methods in this space would likely require a book in itself. As such, our goal here is to highlight the various approaches that can be used, the main tools, and the conditions under which each tool is likely best suited.
Structured criteria Structured criteria approaches are a bit more general and sometimes enmeshed with some of what follows. So we’ll start with this approach. Structured criteria approaches involve exactly what the name implies. 3
Is this even a thing? If not, we think it should be.
204
Statistics for Applied Behavior Analysis Practitioners and Researchers
A set of criteria or tools to supplement and guide visual analysis of time series data to improve agreement about behavior-environment relations (e.g., the intervention changed behavior and identify the function of behavior; Dowdy et al., 2022). If you have been following the book thus far, you might be thinking, “Don’t most statistics involve following a set of rules for analyzing data and predetermined criteria the data must meet to make a claim?” Yes. Yes, they do. The reason the approaches in this section have the name they do is because they were proposed by behavior analysts for behavior analysts as a structured criteria to supplement visual analysis. Before their arrival, visual analysis of data by behavior analysts was often accomplished without explicit criteria or methods4. So, these are a much more structured approach with explicit rules around clinical decision-making compared to how visual analysis had been conducted prior and by most people. Most structured criteria approaches follow similar overall guidelines5 and we’re covering these in a statistics book for two reasons. First, many of the approaches derive statistics to identify cutoff points to which criteria are applied. Second, structured criteria approaches have many benefits but also some drawbacks of which the methods after this section attempt to solve. Thus it can be helpful to know generally the conditions under which structured criteria approaches perform well and the conditions under which alternative approaches might be more useful. So, let’s get into it. Statistics typically are used to (1) create the criterion lines (CLs), (2) analyze data against the CLs, and (3) make a claim that an intervention effect exists or to identify the function of a behavior. One early example of structured criteria was published by Hagopian et al. (1997) and further refined by Roane et al. (2013). When following this approach, the behavior analyst uses statistics to add several lines as a visual supplement to analyze data obtained during functional analyses (top panels, Fig. 8.1). As you likely know, when analyzing data from a 4 We have lied to you here. Technically, some researchers used statistics to analyze time series data from individual subjects very early on and even published them in the early volumes of the Journal of the Experimental Analysis of Behavior (e.g., Boren & Navarro, 1959; Pierrel, 1958). But, likely due to Skinner’s strong opinion opposed to inferential statistics (e.g., Skinner, 1938), these approaches were certainly not in the mainstream and we suspect these were ignored by most who proudly called themselves behavior analysts. 5 We’re just reviewing the basics here. For a comprehensive review, check out Dowdy et al. (2022).
Wait, you mean the clock is always ticking
205
Figure 8.1 Conditions under which criterion methods to supplement visual analysis of time series might be useful (left panels) or misleading (right panels). Top panels show the use of the arithmetic mean and criterion lines 6 1 standard deviation. Bottom panels show the use of the conservative dual crieria (CDC) method. Note in the right panels how trends and nonlinear data patterns can impact the criteria lines. As a result, the interpretation of function (top right panel) or the presence of an intervention effect (bottom right panel) is impacted though—visually—it is a bit more clear.
functional analysis, behavior in each test condition is compared to data in the control condition. Thus CLs are drawn that statistically describe the control condition. The first line is drawn horizontally at the arithmetic mean of responding in the control condition. The upper CL is drawn one standard deviation above the arithmetic mean and the lower CL is drawn one standard deviation below the arithmetic mean6. Once drawn and for each functional analysis condition, you count the number of points above the upper CL and below the lower CL7. Next, you subtract the count below from the count above the CL. Once calculated, you follow two rules to check for trends in the data and differentiation between any given condition and the control; three rules around what to do 6
Statistics! Of note, when a CL equals zero, most past publications consider the data at zero as count “below” the CL.
7
206
Statistics for Applied Behavior Analysis Practitioners and Researchers
when one or more conditions are differentiated from the control; and four rules around what to do when no condition is differentiated from the control. The result is a set of claims as to whether each test condition differs from the control condition. Another example of a structured criteria approach to supplement visual analysis uses the split middle method published in several studies (Bailey, 1984; Kazdin, 1982; Parsonson & Baer, 1986; White, 1974) and further refined by Fisher et al. (2003) into the dual criteria (DC) and conservative dual criteria (CDC) methods. For the unfamiliar, the split middle method is a manual method for estimating a linear regression line for data plotted in x, y coordinates. The idea, then, is to use the split middle method to derive a linear regression of the baseline data (or to let Excel do the hard work for you). This predicted line is then superimposed on the treatment phase and, if using the DC method from Fisher et al. (2003), a line representing the arithmetic mean is also superimposed on the treatment phase (bottom panel, Fig. 8.1). The CDC method involves raising (or lowering for reduction targets) the two CLs by 0.25 standard deviations of the baseline data. To claim differentiation between baseline and the subsequent treatment phase, you count the number of treatment points falling above (or below) the trend line superimposed on the treatment phase. Lastly, you use the binomial equation (Chapter 5) to calculate the probability that the number of data points falling above (or below) the trend lines occurred by chance.
Benefits of structured criteria approaches There are two primary benefits to using structured criteria approaches. First, the math is relatively easy to do, the visual supplements can be added using pen and paper or with a few clicks in Excel, and the end result involves visual analysis and simple counting which is already familiar to behavior analysts. This “ease of use” is definitely an advantage. Second, structured criteria approaches were designed in response to past research suggesting that agreement between any two visual analysts is often (unacceptably?) low (e.g., DeProspero & Cohen, 1979; Ford et al., 2020; Kahng et al., 2010; Matyas & Greenwood, 1990). Past researchers have found that structured criteria approaches improve agreement substantially and reduce the emission of Type I errors by those who use these approaches (Dowdy et al., 2022; Fisher et al., 2003). Thus, considering the function for which these approaches were developed, they appear to fulfill them well.
Wait, you mean the clock is always ticking
207
Limitations that time imposes on structured criteria approaches Despite its benefits, there are some conditions under which the structured criteria approach might be challenging to justify. Most of these stem from the assumptions that need to be made in order for the criteria lines to be created. Of note, the theme here as elsewhere in the book is that the type and distribution of your data should determine whether these approaches are or are not appropriate to answer the question that you have. Nothing in life is inherently good or evil (except double fudge brownies—those are always good). A first important assumption that has to be met when using structured criteria approaches stems from using the arithmetic mean, standard deviation, or a line of best fit to your data. As discussed in Chapters 2 4 and 6, the arithmetic mean and standard deviation are appropriate descriptions of your data only if those data are close to being normally distributed8. Relatedly, because the overall distribution changes over time for trending data, estimating the arithmetic mean and standard deviation using all data would lead to an inaccurate description of trending data within a condition. There are easy workarounds to both of these such as switching from arithmetic mean to a more appropriate measure of central tendency (Chapter 3) and switching from standard deviation to a more appropriate measure of variability (Chapter 4). And, for trending data, one could use only responding once it is stable toward the end of a condition (Journal of the Experimental Analysis of Behavior, anyone?). But we are unaware of anyone who has tested these approaches empirically. And, these approaches would also require another criterion for determining which data to use and which to discard. All-in-all, it is unknown how such changes would impact the benefits gained from using structured criteria approaches. The linear regression approach of DC and CDC helps resolve some of the challenges above for trending levels. However, linear regression 8 A related assumption when using the normal distribution is that the data are independently and identically distributed (iid) randomly within a probability distribution that best describes the data. Because behavior at one time point is necessarily influenced by what happened at previous time points (e.g., autocorrelation), this assumption rarely holds. We’re including this as a footnote, however, as this assumption is more of a bothersome quirk for inferential statistics, not necessarily for the “descriptive”-esque use of the data with structured criteria. Stated differently, the iid assumption seems more academically relevant than practically relevant. But you should know about it in case someone asks.
208
Statistics for Applied Behavior Analysis Practitioners and Researchers
via the split middle method or Excel’s easy point and click method is technically correct only under certain conditions. First, linear regression assumes the relation between time and behavior is, well, linear (Chapters 5, 6). But, behavior may not change linearly with time. For example, percentages are constrained between 0% and 100%, and rate or counts of responding are constrained between zero and the maximum amount of responding possible during an observation. Thus acquisition and reduction curves are often nonlinear (e.g., acquisition often looks like some variation of: ; reduction often looks something like: ). Depending on the data trend during baseline, linear regression models might make predictions when superimposed onto the treatment condition that are outside of what is logically possible (e.g., baseline reversal in lower right panel of Fig. 8.1). A second challenge to using the structured criterion methods in this section is that they allow the behavior analyst to only analyze differences in responding between conditions (i.e., univariate analysis of condition; Chapter 6). In the previously mentioned published literature, this was all that the behavior analysts were after. Thus it worked well enough for the task at hand. Behavior analysts interested in examining the effect of more than one variable on an individual’s patterns of responding would be unable to use only these methods to answer their research or practice questions.
(Non)Overlap statistics One way to eschew all the assumptions around statistical descriptions of data inherent to structured criteria approaches is to avoid using the parameter-based descriptions of your data that rely on an underlying probability distribution. We saw some of this way of thinking with nonparametric alternatives to inferential statistics in Chapter 5. One method of doing this is to simply examine how much data from one condition overlaps with the data from another condition. These are aptly termed overlap statistics (e.g., Costello et al., 2022; Dowdy et al., 2021; Kazdin, 1978; Scruggs et al., 1987). Overlapping statistics are exactly what the name implies. That is, when comparing data across conditions (e.g., baseline to intervention), you calculate the percentage of data from the different conditions that do or do not overlap—in absolute number value. This can be
Wait, you mean the clock is always ticking
209
accomplished in a variety of ways such as the percentage of nonoverlapping data (PND; Kazdin, 1978; Scruggs et al., 1987), nonoverlap of all pairs (NAP; Parker & Vannest, 2009), percentage of all nonoverlapping data (PAND; Parker et al., 2007), and the percentage of data points exceeding the median (PEM; Ma, 2006). Many recent publications have provided wonderful reviews and critiques of how to calculate each of these statistics, their benefits, and their drawbacks (e.g., Costello et al., 2022; Dowdy et al., 2021). You are encouraged to check out those articles for deeper dives. But, here is a surface level skip across the pond.
Benefits of (non)overlap statistics One of the main benefits of (non)overlap statistics is that they do not require any assumptions about the underlying distribution of the data. The data in baseline or intervention conditions can be distributed normally, uniformly, logarithmically, Poisson-ally, bimodally, or any variant in between. We can calculate (non)overlap statistics without regard to the data distributions because they do not rely on a single point estimate to describe the level and variability of the conditions. After all, this is one of the main problems these were designed to solve. In a more jargon-filled statement of this benefit, (non)overlap statistics use more information from the raw data and are less likely to be biased compared to quantitative analyses relying on point estimates. A second main benefit is their ease of calculation. Behavior analysts the world over are familiar and likely make heavy use of percentages in their daily research and practice. Conceptually, these are easy to grasp, can be calculated on the back of a napkin, or Excel can be programmed to automatically run the numbers for you using simple equations. This ease of understanding and calculation can be useful for behavior analysts with a less robust quantitative background as well as when communicating about intervention success with audiences who have a less robust quantitative background. This benefit cannot be undersold.
Limitations that time imposes on (non)overlap statistics This is a chapter about time and its unique impact on quantitative descriptions of behavior-environment relations. Time muddies the water for (non)overlap statistics. For example, all of the (non)overlap statistics above completely ignore any trends in the data for the
210
Statistics for Applied Behavior Analysis Practitioners and Researchers
baseline or intervention conditions. Here, trends might occur in one of at least two ways. These are trending levels or trending variability within a condition. Fig. 8.2 shows the conditions under which this might be a problem. (Non)overlap statistics are likely inappropriate to use if the level of your baseline data is trending in the direction of a treatment effect (top panel, Fig. 8.2), the level of your treatment data is trending toward baseline levels, or the variability in your data in either condition systematically increases (bottom panel, Fig. 8.2) or decreases. With either trending levels or variability in your data, (non) overlap statistics can be a misleading description of either the effectiveness of an intervention or what people might expect in the future. A second limitation that time imposes on (non)overlap statistics is that they do not—by themselves—allow you to identify the influence of more than one variable on behavior change. (Non)overlap statistics simply capture the difference in absolute values between two or more conditions irrespective of their distribution. You could work around this, technically, by using a (non)overlap statistic with other approaches to modeling your data (e.g., regression modeling,
Figure 8.2 Conditions under which (non)overlap statistics might be misleading because of trending level (top panel) or trending variability (bottom panel). In both instances, continuing to collect data in baseline likely would lead to overlap between collected data in baseline and intervention conditions.
Wait, you mean the clock is always ticking
211
classification modeling; Chapter 6). For example, you could use the (non)overlap statistic as the primary dependent variable (DV) across conditions to determine the extent to which other variables played an influential role on behavior change (e.g., therapist, reinforcement schedules, prompting procedures). As another example, you could use the (non)overlap statistics as an IV to predict its influence on stress reduction and quality of life improvement ratings for the client or their family members. But, by themselves, (non)overlap statistics are a relatively rough and simple estimate of the difference in behavior between two conditions.
Effect size measures Chapter 5 discussed many of the commonly used effect size measures in behavior analysis such as standardized mean/median change from baseline (e.g., Cohen’s d, Glass’s Δ, Hedge’s g), strength of association indices, and so on. As a reminder, the function of effect size measures is to more precisely quantify the observed size of behavior change between two conditions, typically baseline and an intervention. Effect size measures are gaining traction in most scientific arenas as a more useful alternative to the traditional null hypothesis significance testing (NHST). This is likely for two reasons. First, the faulty logic of NHST makes it challenging to interpret the p value for whether or not the intervention changed behavior. Second, we often don’t care whether an intervention had an effect, but how big of an effect we observed and how quickly we got to that intervention effect. Effect sizes answer the former more practical question. Thus, wherever possible, behavior analysts are encouraged to use effect sizes to quantitatively describe the effect on behavior following a change in environmental conditions. One effect size that was not explicitly mentioned in Chapter 5 that you should know about is Tau-U (Brossart et al., 2018). Why, you ask? Well, the effect sizes mentioned in Chapter 5 are typically used with static datasets wherein you are not worrying about changes in level or trend within a group or one of the comparator datasets. Time series data used for single-case experimental designs do, however, often have changes in level or trend within conditions. Tau-U (τ-U) is a coefficient related to Kendall’s tau (τ) that accounts for changing trend and level within and between conditions for time series data. Thus Tau-U is a great effect size alternative if you have two time series
212
Statistics for Applied Behavior Analysis Practitioners and Researchers
conditions where baseline or intervention levels or variability are changing.
Benefits of effect size measures One of the primary benefits of effect sizes is one of the reasons they are gaining traction. Effect sizes tell us how big of a change in behavior researchers or practitioners have observed in the past following implementation of an intervention. This is extremely useful for a few reasons. First, effect sizes allow us to have an educated guess as to how big of an impact an intervention might have with our clients or participants who are similar to those who have participated in past published work that contains effect sizes. Second, (most) effect sizes are calculated independent of sample size. Thus simply increasing the total number of your observations (Chapter 7) is unlikely to impact your measure of effect size other than toward its central tendency. This is quite different from NHST statistical tests wherein increasing your sample size makes the exact same effect look better and better.
Limitations that time imposes on effect size measures Time adds three primary challenges to calculating effect sizes. Two of these we have already discussed in trending (i.e., changing) levels or variability over time. Effect sizes that ignore data trends (e.g., Tau, mean/median change) quickly become inaccurate descriptions for the same reasons mentioned above. If baseline is trending toward an intervention effect or intervention is trending toward baseline, then any quantified size of effect is likely to be washed out if we simply conduct more sessions. Further, effect size measures that attempt to include or control for trends (e.g., Tau-U) typically use a nonparametric rank correlation. This is certainly an improvement over those that do not include trends. However, these approaches often fail to account for nonlinear patterns in baseline or intervention conditions. Thus, when your data are nonlinear within a condition, you’ll likely want to either identify a justifiable period of time wherein the “final effect” of each condition on behavior responding is accurately captured; or you’ll want to pivot to regression modeling described in more detail below. Effect sizes that ignore systematically changing variability over time and that use absolute counts of behavior (e.g., standardized mean/ median change) also should raise your spidey senses. The reason is because the effect size you calculate might underestimate the predicted
Wait, you mean the clock is always ticking
213
effect (if variability is reducing) or overestimate the predicted effect (if variability is increasing). In itself, this may or may not be a big deal depending on the audience with whom you are communicating the effect size. But, in terms of pure accuracy of calculated effect size, such trending variability will reduce its accuracy. Similar to trending levels of data, the “solution” is often to identify a justifiable period of time wherein the “final effect” of each condition on behavior is accurately captured; or you’ll want to pivot to regression modeling. The third challenge that time adds to calculating effect sizes goes back to the second practical question we mentioned above. That is, we often want to know not just how big of an effect we can expect, but how quickly we can get there. At the time of this writing, no known effect sizes account for both size and speed. This is an active area of research, however, and as soon as this hits the press, we suspect someone somewhere will have solved this problem. Whenever this happens, we are excited to work that solution into our milieu. Gaining more accurate predictions around how long an intervention should last is of significant social value for clients, behavior analysts, funders, and researchers seeking to improve intervention efficiency.
Regression and classification modeling Regression and classification models were discussed at length in Chapter 6 and so won’t be reviewed in depth here. But, to quickly recap, both regression and classification models seek to mathematically relate one or more IVs (e.g., baseline vs. intervention; home vs. school setting; reinforcement schedule) to a DV (e.g., responses per minute; percentage of trials with correct responding). The goal often is to precisely describe and predict how much or whether behavior will occur as a function of the IV(s) included in the model. With regression modeling, we try to relate IVs to a continuous DV (or perhaps ordinal if there are enough discrete levels that the resulting distribution is “close enough” to continuous). With classification modeling, we try to relate IVs to a categorical or nominal DV. As discussed above, researchers have used linear regression to model the relationship between behavior change and time (e.g., Bailey, 1984; Fisher et al., 2003; Kazdin, 1982; Parsonson & Baer, 1986; White, 1974). Typically the goal is to best describe the trend in
214
Statistics for Applied Behavior Analysis Practitioners and Researchers
changing level of behavior during one condition (e.g., baseline) so as to superimpose predicted behavior during a second condition (e.g., intervention). The assumption is that the superimposed line (and its related statistical properties) is a best guess prediction as to what behavior would have looked like had no environmental manipulation taken place (Fig. 8.3). The researcher can then determine the extent to which patterns of behavior predicted as extending from the first condition are similar to patterns of behavior observed during the second condition.
Benefits of modeling approaches There are several benefits to using regression (or classification) modeling approaches to describe and make predictions about time series data. First, many of the above approaches failed to consider trending levels of behavior change within and between conditions. Regression models can handle linear and nonlinear trends in level quite easily such that predictions of behavior for comparison across conditions is quite straightforward (Fig. 8.3). The steps would likely involve something
Figure 8.3 Examples of linear (top panels) and nonlinear (bottom panels) regression modeling of time series data. In the top panels, linear regression predictions are extended into the intervention effect and replication phases relevant to that phase. In the bottom panels, the nonlinear regression model is included in the original phase being modeled to make it easier to visually see the nonlinear pattern being modeled. Left panels model the level of behavior in each condition. Right panels model the level of variability in each condition. In all panels, error bars represent 95% confidence intervals.
Wait, you mean the clock is always ticking
215
along the lines of creating a model to describe behavior in the first condition (e.g., baseline) as a function of one or more IVs. If the model was good enough (see Chapter 6), the behavior analyst could use the derived model to predict what behavior would look like during the span of time encapsulated by the second condition (e.g., during intervention). Lastly, the behavior analyst could compare whether the observed behavior during the second condition differed significantly (statistically or socially; Chapter 5) or the size of that difference (effect sizes; Chapter 5)9. A second benefit to modeling time series data is that analysis of trending variability could be examined and accounted for directly and quantitatively (right panels, Fig. 8.3). Here, the behavior analyst would graph the variability from datum to datum and across conditions. Once variability were plotted in isolation from level, the behavior analyst could use the same modeling approaches described above to describe and predict how variability might trend over time and whether meaningful differences in variability exist across conditions (statistically or socially) or the size of that difference (effect sizes). At the time of writing, we are unaware of researchers or practitioners who consistently use this approach to quantitatively analyze variability directly. Nevertheless, it is a logical extension of existing approaches for those wishing to do so. And, for those familiar with the utility of calculating derivatives and dynamics in other sciences, this likely is a familiar analytic method. A final benefit to using modeling approaches to analyze time series data is that it is relatively straightforward to include more than one IV in the analysis. Chapter 6 described in detail much of how this works. We note this benefit here as the ability to include multiple variables within the analysis of time series data was not as direct in the approaches mentioned above. Specifically, structured criteria, (non) overlap statistics, and effect sizes lead to a quantitative value that could be used in a second step as the DV to analyze the relative influence of multiple IVs on that DV. Modeling approaches allow you to do that in a single step via model building. This is extremely useful as 9
NB: In many situations, this approach would be similar to the structured criteria approaches. Similarities and differences would likely extend from (1) how the model is developed, (2) the ability to include multiple variables and nonlinear relationships, (3) explicit testing and description of model fit, and (4) how estimates are derived to quantify differences between the observed and predicted data in the second condition.
216
Statistics for Applied Behavior Analysis Practitioners and Researchers
you can compare the effect of your intervention along with other variables directly and at the same time.
Limitations that time imposes on modeling approaches Time also adds a bit of a wrinkle to using simple regression or classification models to describe and predict time series data. Most of these stem from the set of assumptions required for regression models to be statistically robust. These include that the IVs and DVs are normally distributed, there are no outliers, the IVs are not systematically associated with each other (i.e., uncorrelated), and that the residuals are not autocorrelated (i.e., error at time t does not depend on errors at time t1). For linear regression models, specifically, additional assumptions include that the arithmetic mean of the distribution of errors is zero, the variance of errors across all levels of the IVs is constant (i.e., homoscedasticity), and the errors are normally distributed. With time series data, autocorrelation of the residuals is typically the rule, rather than the exception. There are ways to handle situations where these assumptions are violated (see Chapters 6 and 9). And, rarely in applied situations can statistical assumptions be perfectly met10. Nevertheless, modeling time series data does require a bit more analytic skill than the approaches mentioned above if one wants to use it as accurately as possible based on statistical theory.
Nested approaches to modeling Modeling approaches that handle nested or hierarchical data are another form of modeling time series data that have gained traction recently in published behavior analytic research. Nested or hierarchical data simply means data where you have information from multiple levels or hierarchies such that they are related in some way. For example, you may have data on the effect of the good behavior game (e.g., Barrish et al., 1969; Peltier et al., 2023) for 100 different students from five different classrooms. Here, students are “nested” within classrooms such that the data labels for each classroom are not independent of the students in that classroom. As another example, you may have skill acquisition data for a single client spanning 100 sessions, 20 programs, and the settings of clinic, home, and school. Here, trial data are 10 Google the allegory of the spherical cow for one practical reason why it is okay if all statistical assumptions are not perfectly met.
Wait, you mean the clock is always ticking
217
nested within programs, which are nested within sessions, which are nested within intervention settings. Models that handle such nested data often come from a family of analyses known as mixed effects modeling (e.g., multilevel modeling; hierarchical linear modeling).
Benefits of nested modeling approaches The primary benefit to mixed effects models is that they can handle data where the observations are not independent of each other (Garson, 2013). For example, in the modeling approaches described in the previous section, many of the assumptions are often violated in time series data of human behavior—especially around independence of observations and residuals. This can lead to the predictor variables being misinterpreted in magnitude and sometimes even direction (Garson, 2013). For examples already within the behavior analytic corpora, multilevel modeling has been used to analyze indifference data in discounting (e.g., Young, 2017), to model cigarette purchase task data (e.g., Zhao et al., 2016), and to model reinforcer preference following task completion in children (e.g., DeHart & Kaplan, 201911). Lastly, just as with the modeling approaches discussed above, mixed effects models allow the behavior analyst to describe the influence of multiple IVs on behavior (i.e., multivariable mixed effects models) and can describe linear and nonlinear relationships (e.g., Bolker, 2008; Lindstrom & Bates, 1990). Given the many benefits of mixed effects models for time series data, readers seriously interested in quantitative analyses of time series data are strongly recommended to learn how to use this approach in their work.
Limitations of nested modeling approaches As with any description and prediction about behavior, the one doing the describing and predicting has to make some assumptions. The same holds true for mixed effects models. You may have noticed we changed this subheader slightly from the previous sections as the limitations are less about time, per se. Some of this starts to get a bit into the weeds on exactly how your favorite software program conducts these modeling analyses under the hood. But, because this is a book about exposure and gentle introductions, here goes nothing. 11
NB: This article also offers a fantastic single article overview and tutorial for how to do this kind of work. For those who read this article and find themselves ready to bathe fully in these crystal blue waters, we recommend Garson (2013).
218
Statistics for Applied Behavior Analysis Practitioners and Researchers
One of the primary assumptions of mixed effects models is that observations within a cluster are positively correlated (Nielsen et al., 2021). But, this may not always be the case. For example, past researchers have observed negative intraclass correlations in zero-sum situations where the behavior of one individual is dependent on the behavior of another (e.g., finite resource distribution amongst members of a group; Kenny et al., 2002; Pryseley et al., 2011). And, sometimes intraclass correlations may be small. In both situations, the results of mixed effects modeling may lead to less accurate results and inflated Type I errors (i.e., false positives; Baldwin et al., 2008; Nielsen et al., 2021). The second limitation to the use of mixed effects models is that many implementations assume that any given IV has either a fixed effect or random effect on the DV (Baltagi, 2008). Fixed effects refer to an assumption that the IV has the same effect on the DV across all observations. For example, the contingencies associated with an intervention have the same effect on responding throughout the entire intervention condition. Other examples of fixed effects might be age, cultural background, and intervention setting. Random effects refer to an assumption that the IV has a fixed relationship with the DV, but that the effect of the IV on the DV will vary randomly from one observation to the next. For example, maybe the intervention setting of school is consistently associated with greater responding compared to home, but it—randomly—will influence responding differently from day to day in some small way. As you might imagine, assuming only a fixed effect or a consistent but random effect are not the only two possibilities. There are some situations where the effect of an IV on a DV might increase or decrease over time such as when learning takes place across a condition or with satiation. Here, dynamical models might be a better choice. Lastly, your intuition was not wrong if you felt that things were starting to get complicated as we moved into mixed effects models. A final limitation to be aware of is that mixed effects models are much more computationally demanding than the other approaches mentioned above (Russell, 2022). It is true that computers do this hard work for us and that the datasets behavior analysts play with are likely to be smaller compared to those of other disciplines. But, as the number of observations and variables in a model grows, the likelihood increases that the models fail to converge; and it is well known that
Wait, you mean the clock is always ticking
219
applying random effects to slopes increases time to converge compared to applying random effects to intercepts (Russell, 2022). Given the importance of slopes (i.e., changing levels of behavior over time) to the time series data that behavior analysts play with, this is a nontrivial consideration. To compound this, it is often best practice to evaluate mixed effects models relative to simpler models (e.g., models fit using ordinary least squares and with or without random effects; Russell, 2022). All of this, thus, requires increased time, computational resources, and statistical acumen above and beyond what many of the other approaches require.
Chapter summary Though exceptions are increasingly common, behavior analysts primarily analyze repeated observations of some metric of behavior (e.g., response rate, duration, percentage of trials) as a function of some metric of time (e.g., session). Behavior analysts most commonly use visual analysis to analyze time series data given the ease with which it can be conducted (you just need to look at the graph); its conservative nature in identifying an effect; and the ability to simultaneously analyze level, trend, and variability within and across conditions. Nevertheless, a significant limitation of visual analysis is its imprecise nature which can lead to inconsistent claims of behavior change among visual analysts. In most sciences, the ability to replicate experimental and analytic methods is considered critical for making claims about how the universe works. Thus, compared to statistical techniques, it is a nontrivial problem that two behavior analysts employing visual analysis will not always, everywhere, and repeatedly arrive at the same decision with the same dataset. The possibility for making inconsistent claims about behavior change via visual analysis has led researchers to develop analytic methods that lead to consistent and more precise descriptions of time series data. This chapter reviewed at a high level some of the common structured approaches to analyzing time series data that allow for direct replication to produce the same result every time. And, in one way or another, each of the approaches in this chapter makes use of the statistical topics we have discussed throughout the book. The one unique wrinkle to simply applying the topics from Chapters 1 7 to time series data is. . .well. . .time. Analyzing time series data requires you to
220
Statistics for Applied Behavior Analysis Practitioners and Researchers
analyze changes in level, trend, and variability—simultaneously—and within and across conditions. This is a bit different from the more “static” datasets that statistics have historically been used to describe. Structured criteria approaches (in the behavior analytic jargony sense) are perhaps the “simplest” statistical approach to analyzing time series data. The idea here is to create a set of rules for drawing lines on your time series graph wherein you can then analyze whether data from different conditions are similar or different. Statistics are used to create those lines using the arithmetic mean, standard deviations, or to create an estimated line via linear regression for the DC and CDC approaches. Though logically straightforward, the statistical assumptions required for these approaches to be technically accurate are likely violated with most time series datasets12. And, in themselves, these approaches do not lend to easy analysis of the multiple control of behavior, systematically changing variability within conditions, nonlinear patterns in our datasets, nor speed to behavior change. (Non)overlap statistics are one way to eschew all the statistical assumptions of the structured criteria approaches (in the behavior analytic jargony sense). Here, the high-level idea is to simply count how many observations in each condition “overlap” with the data in different conditions. This reads intuitively and is relatively straightforward if your data do not change levels or variability within conditions. However, quantitative descriptions of “overlap” are likely to underestimate differences if the data trend away from the level of other conditions or overestimate differences if the data trend toward the level of other conditions. Relatedly, systematically increasing variability will likely lead to overestimates of differences and systematically decreasing variability will likely lead to underestimates of differences. Lastly, in addition to failing to account for trends in your data, these approaches also do not in themselves allow for the analysis of the multiple control of behavior, nonlinear patterns in our datasets, nor speed to behavior change. Effect size measures are another approach to quantifying differences in behavior over time. Effect sizes are a welcomed and much preferred alternative to NHST for describing the amount of behavior change we 12 Whether that matters is an entirely different conversation. See the footnote above about the spherical cow.
Wait, you mean the clock is always ticking
221
might expect between conditions or groups. As we saw in Chapter 5, effect sizes come in many shapes and sizes. Some effect size measures for time series data also do not account for trending level or variability and, thus, have the same limitations discussed in the previous paragraph. Effect sizes that do account for trends are obviously preferred where applicable (e.g., Tau-U). However, these may not be able to analyze multiple control, identify nonlinear patterns in our data, nor tell us how quickly we can get to behavior change. The last two approaches discussed in this chapter all surround statistically modeling time series data. At a basic level, the regression and classification modeling approaches introduced in Chapter 6 could also be used with time series data. However, given behavior during one observation is often going to be related in some way to behavior during other observations, nested modeling approaches are likely the best approach to modeling time series data. The benefits of statistically modeling time series data are that they can account for most of the limitations of the previous approaches. They can handle data with trending levels or variability, they can quantify the size of differences to be expected, they can quantify the multiple control of behavior, and they can handle linear or nonlinear data paths. The downsides to modeling approaches are twofold. First, their flexibility requires a more sophisticated skill set to appropriately conduct the analyses. Most of the earlier approaches require calculating the arithmetic mean or standard deviation, or counting data points that do or do not “overlap,” whereas modeling approaches require a bit more steps and nuanced considerations. Second, modeling approaches can sometimes require more data than a behavior analyst has available to them so that the results are reliable and robust. Nevertheless, in terms of accuracy and flexibility, modeling approaches offer the best bang for their buck. To close, we’d like to make explicit two points that have been implicit in the chapter thus far. First, at no point did we make a claim or recommendation of which approach is the “best” approach to use. This is because the answer to that question really depends on your data, what question you are trying to answer, the precision of the answer you need, and the audience with which you will communicate the results. Each approach has benefits and limitations and is most suitable to certain contexts. The art of analytics is knowing when to use which tool in your toolbox. BUT, at a minimum, each of the
222
Statistics for Applied Behavior Analysis Practitioners and Researchers
approaches in this chapter allows you and anyone else who looks at your data to replicate your analytics to arrive at the exact same answer every time. This is a benefit that is head and shoulders above using visual analysis alone. The second point we want to make explicit is that we focused only on specific cases wherein time series data are being analyzed. Often, these involved behavior analysts comparing behavior across two conditions— baseline and intervention. But, many other disciplines also use time series data as their primary pulse on life. They, also, have had to wrestle with the thorny problems we discussed in this chapter. And, time series data doesn’t always have to involve comparing two conditions in an AB design. In the next chapter, we’ll review the basics of quantitatively analyzing time series data that other disciplines commonly employ and their relevance to questions that behavior analysts might want to ask. See you next time. ;-)
References BACB (n.d.). BACB certificant data. Retrieved from: https://www.bacb.com/bacb-certificant-data/. Bailey, D. B. (1984). Effects of lines of progress and semilogarithmic charts on ratings of charted data. Journal of Applied Behavior Analysis, 17(3), 359 365. Baldwin, S. A., Stice, E., & Rohde, P. (2008). Statistical analysis of group-administered intervention data: Reanalysis of two randomized trials. Psychotherapy Research, 18(4), 365 376. Available from https://doi.org/10.1080/10503300701796992. Baltagi, B. H. (2008). Econometric analysis of panel data (4th ed.). Wiley, ISBN: 978-0-47051886-1. Barrish, H. H., Saunders, M., & Wolf, M. M. (1969). Good behavior game: Effects of individual contingencies for group consequences on disruptive behavior in a classroom. Journal of Applied Behavior Analysis, 2(2), 119 124. Available from https://doi.org/10.1901/jaba.1969.2-119. Bolker, B. M. (2008). Ecological models and data in R. Princeton University Press, ISBN: 0691125228. Boren, J. J., & Navarro, A. P. (1959). The action of atropine, benactyzine, and scopolamine upon fixed-interval and fixed-ratio behavior. Journal of the Experimental Analysis of Behavior, 2(2), 91 177. Available from https://doi.org/10.1901/jeab.1959.2-107open_in_new. Brossart, D. F., Laird, V. C., & Armstrong, T. W. (2018). Interpreting Kendall’s Tau and Tau-U for single-case experimental designs. Cogent Psychology, 5, 1518687. Available from https://doi. org/10.1080/23311908.2018.1518687. Cohn, E. (1904). Zur elektrodynamik bewegter systeme II. Sitzungsberichte der Königlich Preussischen Akademie der Wissenschaften, 43(2), 1404 1416. Cooper, J. O., Heron, T. E., & Heward, W. L. (2019). Applied Behavior Analysis (3rd ed.). Pearson. Costello, M. S., Bagley, R. F., Bustamante, L. F., & Deochand, N. (2022). Quantification of behavioral data with effect sizes and statistical significance. Journal of Applied Behavior Analysis, 55(4), 1068 1082. Available from https://doi.org/10.1002/jaba.938.
Wait, you mean the clock is always ticking
223
Cox, D. J., & Brodhead, M. T. (2021). A proof of concept analysis of decision-making with timeseries data. The Psychological Record, 71(3), 349 366. Available from https://doi.org/10.1007/ s40732-020-00451-w. DeHart, W. B., & Kaplan, B. A. (2019). Applying mixed-effects modeling to single-subject designs: An introduction. Journal of the Experimental Analysis of Behavior, 111(2), 192 206. Available from https://doi.org/10.1002/jeab.507. DeProspero, A., & Cohen, S. (1979). Inconsistent visual analyses of intrasubject data. Journal of Applied Behavior Analysis, 12, 573 579. Available from https://doi.org/10.1901/jaba. 1979.12-573. Dowdy, A., Jessel, J., Saini, V., & Peltier, C. (2022). Structured visual analysis of single-case experimental design data: Developments and technological advancements. Journal of Applied Behavior Analysis, 55(2), 451 462. Available from https://doi.org/10.1002/jaba.899. Dowdy, A., Peltier, C., Tincani, M., Schneider, W. J., Hantula, D. A., & Travers, J. C. (2021). Meta-analyses and effect sizes in applied behavior analysis: A review and discussion. Journal of Applied Behavior Analysis, 54(4), 1317 1340. Available from https://doi.org/10.1002/ jaba.862. Einstein, A. (1905). Zur elektrodynamik bewegter körper. Annalen der Physik, 322(10), 891, Retrieved from. Available from https://www.fourmilab.ch/etexts/einstein/specrel/www/. Fisher, W. W., Kelley, M. E., & Lomas, J. E. (2003). Visual aids and structured criteria for improving visual inspection and interpretation of single-case designs. Journal of Applied Behavior Analysis, 36(3), 387 406. Available from https://doi.org/10.1901/jaba.2003.36-387. Ford, A. L. B., Rudolph, B. N., Pennington, B., & Byiers, B. J. (2020). An exploration of the interrater agreement of visual analysis with and without context. Journal of Applied Behavior Analysis, 53(1), 572 583. Available from https://doi.org/10.1002/jaba.560. Garson, G. D. (2013). Hierarchical linear modeling: Guide and applications. SAGE Publications, Inc., ISBN: 9781412998857. Hagopian, L. P., Fisher, W. W., Thompson, R. H., Owen-DeSchryver, J. O., Iwata, B. A., & Wacker, D. P. (1997). Toward the development of structured criteria for interpretation of functional analysis data. Journal of Applied Behavior Analysis, 30(2), 313 326. Available from https:// doi.org/10.1901/jaba.1997.30-313. Ivey, D. G., & Hume, J. N. P. (1974). Physics: Classical mechanics, and introductory statistical mechanics. University of Michigan Press. Kahng, S. W., Chung, K. M., Gutshall, K., Pitts, S. C., Kao, J., & Girolami, K. (2010). Consistent visual analyses of intrasubject data. Journal of Applied Behavior Analysis, 43, 35 45. Available from https://doi.org/10.1901/jaba.2010.43-35. Kazdin, A. E. (1978). Methodological and interpretive problems of single-case experimental designs. Journal of Consulting and Clinical Psychology, 46, 629 642. Available from https://doi. org/10.1037/0022-006X.46.4.629. Kazdin, A. E. (1982). Single-case research designs: Methods for clinical and applied settings. Oxford University Press. Kenny, D. A., Mannetti, L., Pierro, A., Livi, S., & Kashy, D. A. (2002). The statistical analysis of data from small groups. Journal of Personality and Social Psychology, 83(1), 126 137. Available from https://doi.org/10.1037/0022-3514.83.1.126. Lindstrom, M. J., & Bates, D. M. (1990). Nonlinear mixed effects models for repeated measures data. Biometrics, 46(3), 673 687. Available from https://www.jstor.org/stable/2532087. Ma, H. (2006). An alternative method for quantitative synthesis of single-subject research: Percentage of data points exceeding the median. Behavior Modification, 30(5), 598 617. Available from https://doi.org/10.1177/0145445504272974.
224
Statistics for Applied Behavior Analysis Practitioners and Researchers
Matyas, T. A., & Greenwood, K. M. (1990). Visual analysis of single-case time series: Effects of variability, serial dependence, and magnitude of intervention effects. Journal of Applied Behavior Analysis, 23(3), 341 351. Available from https://doi.org/10.1901/jaba.1990.23-341. Merriam-Webster (2021). Statistics. Retrieved from the website: https://www.merriam-webster. com/dictionary/statistics Nielsen, N. M., Smink, W. A. C., & Fox, J. P. (2021). Small and negative correlations among clustered observations: Limitations of the linear mixed effects model. Behaviormetrika, 48, 51 77. Available from https://doi.org/10.1007/s41237-020-00130-8. Parker, R. I., Hagan-Burke, S., & Vannest, K. (2007). Percentage of all nonoverlapping data (PAND): An alternative to PND. The Journal of Special Education, 40(4), 194 204. Available from https://doi.org/10.1177/00224669070400040101. Parker, R. I., & Vannest, K. (2009). An improved effect size for single-case research: Nonoverlap of all pairs. Behavior Therapy, 40(4), 357 367. Available from https://doi.org/10.1016/j. beth.2008.10.006. Parsonson, B. S., & Baer, D. M. (1986). The graphic analysis of data. In A. Poling, & R. W. Fuqua (Eds.), Research methods in applied behavior analysis: Issues and advances (pp. 157 186). Plenum. Peltier, W., Newell, K. L., Linton, E., Holmes, S. C., & Donaldson, J. M. (2023). Effects of and preference for student- and teacher-implemented good behavior game in early elementary classes. Journal of Applied Behavior Analysis, 56(1), 216 230. Available from https://doi.org/10.1002/ jaba.957. Pierrel, R. (1958). A generalization gradient for auditory intensity in the rat. Journal of the Experimental Analysis of Behavior, 1(4), 303 313. Available from https://doi.org/10.1901/ jeab.1958.1-303. Pryseley, A., Tchonlafi, C., Verbeke, G., & Molenberghs, G. (2011). Estimating negative variance components from Gaussian and non-Gaussian data: A mixed models approach. Computational Statistics & Data Analysis, 55(2), 1071 1085. Available from https://doi.org/10.1016/j. csda.2010.09.002. Roane, H. S., Fisher, W. W., Kelley, M. E., Mevers, J. L., & Bouxsein, K. J. (2013). Using modified visual-inspection criteria to interpret functional analysis outcomes. Journal of Applied Behavior Analysis, 46(1), 130 146. Available from https://doi.org/10.1002/jaba.13. Russell, M. (2022). Statistics in natural resources. CRC Press, ISBN: 9781032258782. Scruggs, T. E., Mastropieri, M. A., & Casto, G. (1987). The quantitative synthesis of singlesubject research: Methodology and validation. Remedial and Special Education, 8(2), 24 33. Available from https://doi.org/10.1177/074193258700800206. Skinner, B. F. (1938). The behavior of organisms: An experimental analysis. Appleton-CenturyCrofts. White, O.R. (1974). The “split middle”—a “quickie” method of trend estimation. Seattle: Experimental Education Unit, Child Development and Mental Retardation Center, University of Washington. Young, M. E. (2017). Discounting: A practical guide to multilevel analysis of indifference data. Journal of the Experimental Analysis of Behavior, 108(1), 97 112. Available from https://doi.org/ 10.1002/jeab.265. Zhao, T., Luo, X., Chu, H., Le, C. T., Epstein, L. H., & Thomas, J. L. (2016). A two-part mixed effects model for cigarette purchase task data. Journal of the Experimental Analysis of Behavior, 106(3), 242 253. Available from https://doi.org/10.1002/jeab.228.
CHAPTER
9
This math and time thing is cool! Time series decomposition and forecasting behavior On my visit to Chicago, the weather forecast said it was muggy. The forecaster was right. I went outside and someone stole my shoes.
Introduction This chapter is a bit of the cherry on top of the icing on top of the statistical cake we hope you have been enjoying thus far. The cake in itself is often delicious (except for angel food cake, maybe) and forms the bulk of what comprises the dessert to be savored. The cake was Chapters 1 5. In most situations, you’re likely to encounter throughout your research and practice, the material in Chapters 1 5 will be all you need to communicate more precisely about what you have observed or will allow you to understand and evaluate the claims someone else makes about the effect that an intervention or research protocol has on behavior. To recap, the cake we’ve been eating included what exactly a statistic is and why numbers are useful (Chapter 1); data types and data distributions commonly encountered in behavior analysis (Chapter 2); how we can describe the central tendency of our data assuming it is at stability and based on the data type and distribution from which it comes (Chapter 3); how we can describe the variability in our data assuming it is at stability and based on the data type and distribution from which it comes (Chapter 4); and different ways we can infer whether our intervention actually changed behavior (hopefully for the better) and based on the data type and distribution from which it comes (Chapter 5). Yum! If the first five chapters were the cake, Chapters 6 8 were the icing. Technically, you can probably have your cake (and eat it, too) without any icing. But, how fun is that? Most nonempirical situations involve Statistics for Applied Behavior Analysis Practitioners and Researchers. DOI: https://doi.org/10.1016/B978-0-323-99885-7.00002-7 © 2023 Elsevier Inc. All rights reserved.
226
Statistics for Applied Behavior Analysis Practitioners and Researchers
a bunch of variables that influence behavior but of which we could not/cannot control. Modeling helps us estimate the influence of one (or more) variables on behavior so that we can more precisely talk about the size and direction of that influence (Chapter 6). And, our confidence or certainty in our claims about something influencing behavior increases with more observations. Knowing how many observations we need to feel good about the statistical measures we calculate is extremely valuable information when planning and carrying out data collection or knowing when to stop (Chapter 7). Lastly, behavior analysts have the good(?) fortune of playing primarily with what is arguably the most challenging set of data to statistically work with: time series data. But fear not. Smart people have taken on the unique challenges of time series data and come up with some simple solutions that often work well enough (Chapter 8). How delicious is that icing!!??!?! Your statistical cake is likely to turn out quite nicely for many to enjoy if you have mastered the icing from Chapters 6 8 to put on your cake from Chapters 1 5! This chapter sits on top of the icing as the maraschino cherry (or the blueberry, or peach, or an accompanying glass of sangiovese—pick your poison). Here, we’re starting to indulge ourselves a little. But, oh my, how a little self-indulgence can amplify the experience and create new opportunities. In this chapter, we hope to amplify your experience by talking about time series modeling methods common to other areas of science and industry. Yes, it turns out that behavior analysts are not the only people in the universe who play with time series data of human behavior. In this chapter, we introduce you to these alternate universes to highlight the potential utility of these tools for behavior analysts. And, we’ll also highlight how the predictions being made by these folks often make the “predictions” behavior analysts sometimes make. . .well. . .more like post-hoc descriptions. In the first part of this chapter, we’ll start by showing how we can use statistics to break down time series data into its individual components. As discussed in the last chapter, behavior analysts are familiar with the time series data characteristics of trend, level, and variability. We’ll talk about methods for taking time series datasets and breaking them into each one of those components quantitatively so that they can be described with precision. In this section, we’ll also review other components of time series datasets we have yet to see behavior analysts talk about: seasonality, cyclicity, and stationarity.
This math and time thing is cool! Time series decomposition and forecasting behavior
227
In the latter half of the chapter, we’ll look at two of the most common methods for predicting future behavior. We’re not talking about looking at a complete dataset and claiming, “I would have predicted behavior would have gone here had I not changed the condition.” No, here, we’re talking about forecasting future behavior. That’s right. Just like weather forecasters who use historical data and current conditions to make predictions about weather yet to be experienced based on current conditions, we’ll show you how researchers and practitioners in other disciplines put their money where their mouth is to predict human behavior yet to occur. Prediction at its most real and raw. Thrilling, isn’t it! As one final introductory note, the purpose of this chapter is a bit different than with the ones that came before. Like previous chapters, all the information in this chapter can be found in much greater depth in other readily available textbooks and articles. Unlike the chapters before this one, we are not aware of any behavior analysts who have published analyses of their data within behavior analytic journals using these methods, nor we have seen these techniques used in any behavior analytic textbooks. Our rationale for including them here is simply to expand your toolbox. “If the only tool you have is a hammer, it is tempting to treat everything as if it were a nail” (Maslow, 1966; p. 16). “Visual analysis” can certainly be considered the hammer of behavior analysis for analyzing time series data. In the last chapter, we reviewed how behavior analysts will, in seemingly academic contexts only, use a screwdriver or pliers. This chapter provides spanners, sockets, wrenches, and torches. We can’t wait to see what art people build as they think about their time series data through these lenses. If none of the above is interesting to you (or you’ve simply become hungry with all this talk of cake and want to step away), no worries. As noted earlier, this chapter is more of an extra morsel for the interested. You can skip ahead to the next chapter without harm. But, for those interested in what other disciplines are doing in this space and how shifting our focus might change our view, welcome to the fun. Hopefully you leave this chapter with some novel tools that allow you to measure, describe, and predict behavior-environment relations with different utility.
Time series analyses through a different lens Behavior analysts are not unique butterflies when it comes to measuring, describing, and predicting behavior based on the environment
228
Statistics for Applied Behavior Analysis Practitioners and Researchers
surrounding an organism. For example, neuroscientists and physiologists measure how the presentation and removal of stimuli in an organism’s environment leads to changes in physiology over time1; ecologists measure the interaction between varying characteristics of a larger environment and a plant or animal’s behavior over time; and business analysts the world over measure how economic and market contexts change the behaviors of those who buy their products over time. In these, and many other, situations, people attempt to make sense of human behavior as a result of interactions between characteristics of a larger context over time. In the last chapter, we reviewed several common approaches to statistically analyzing time series data in behavior analysis. As with much of what we have covered in the book, statistical descriptions and analyses are tools humans use to accomplish some goal based on the context within which they find themselves. The statistical analyses in the previous chapter were focused primarily on situations with three defining characteristics. First, when a behavior analyst has relatively few observations (e.g., fewer than 50 total observations). Second, when the behavior analyst wants to compare responding at stability under two or more well-defined conditions (e.g., baseline and intervention). Third, when the goal is to determine whether behavior differs across the conditions while simultaneously considering trend, level, and variability. In this section, we briefly review several other ways we can think about quantitatively analyzing time series data when one or more of the above three conditions change. For example, what if we data spanning years of client progress? And, across that time, we have many different programs we have created and contingencies have dynamically changed such that there is never really a single baseline or intervention condition in effect? Lastly, how might we more quantitatively describe and predict how trend, level, and variability will each change in the future? In what follows, we’ll show how other scientists and practitioners answer questions like these by first discussing methods for separating and analyzing the various components that comprise time series data (i.e., decomposition models). Once separated into its parts, we’ll then review popular methods for predicting future behavior. 1
“Yes, “all’s behavior—and the rest is naught.”” (Skinner, 1972, p. 348).
This math and time thing is cool! Time series decomposition and forecasting behavior
229
Time series decomposition Any behavior analyst reading this book is likely familiar with three of the fundamental characteristics of all time series data: trend, level, and variability. As noted in popular behavior analytic textbooks (e.g., Cooper et al., 2020), all three of these characteristics are often analyzed simultaneously when a behavior analyst conducts visual analysis of data. Also noted in those same behavior analytic textbooks is that trend, level, and variability can vary on a continuous scale (i.e., there are an infinite range of values each of these characteristics could take). The result is the potential for an exponential combination of time series characteristics that a behavior analyst might contact throughout their research and practice. Though visual analysis of these three characteristics is often described simply, the interaction between these characteristics within and between conditions can make visual analysis a challenging skill to learn. And, this challenge can lead to nontrivial disagreement between any two behavior analysts around whether behavior changed or not (e.g., DeProspero & Cohen, 1979; Ford et al., 2020; Kahng et al., 2013; Matyas & Greenwood, 1990). Adding to this complexity, researchers and practitioners in other domains often include three additional characteristics of time series data. These are seasonality, cyclicity, and stationarity. Seasonality2 refers to relatively similar patterns of behavior that repeat over a consistent period of time (e.g., energy usage based on time of day, online sales based on time of year). Cyclicity refers to a pattern in the data that does not repeat after similarly consistent periods of time (i.e., it’s aperiodic; e.g., housing prices, rates of behavioral acquisition). Stationarity refers to descriptive statistical measures (e.g., arithmetic mean, standard deviation) of time series data remaining consistent over some period of time. Also of note, in this literature, researchers and practitioners also will often refer to variability as “noise” or “error” which simply refers to the remaining residuals once we quantitatively accounted for the rest. If you read that last sentence closely, you’ll likely guess where we’re headed next.
2 Note that the technical definition of seasonality differs from the lay definition. Seasonal patterns do not have to align with the calendar seasons.
230
Statistics for Applied Behavior Analysis Practitioners and Researchers
The main idea Decomposition models break our data into two or more of the characteristics (trend, level, variability, seasonality, cyclicity, stationarity) of time series data so that we can more easily analyze the individual components. Once we have isolated each of the components and how they are changing over time, we can also figure out how they combine back together to predict behavior in the future. Such predictions about behavior in the future are called forecasting (e.g., weather forecasting). To our knowledge, behavior analysts have never quite published true predictions of future behavior opting typically to describe behaviorenvironment relations post hoc. Thus we suspect that much of this will be new information for readers and, we hope, suggests many possible opportunities for future research and experimentation around how these methods might be used fruitfully in behavior analysis. Throughout this chapter, we are going to use an example dataset obtained via the pandas package in Python (McKinney, 2010) that we have tweaked and relabeled for a behavior analytic context. These time series data are shown in the top panel of Fig. 9.1 along with three additional time series plots that highlight the basics of what decomposition models do3. The top panel shows the average number of programs mastered per week (y-axis) by a hypothetical client over the course of 10 years of receiving services based on the principles of behavior (x-axis). The purpose of a decomposition model is to separate out two or more characteristics of the time series data so they can be analyzed separately from the remaining characteristics. We’ll start simply by only separating out trend, seasonality, and variability. The bottom three panels of Fig. 9.1 are the output of the statsmodels time series analysis seasonal decomposition package for Python (Seabold & Perktold, 2010)4. Essentially, what this package does is to break down the data into the components of (1) trend at each time point (Tt), (2) seasonality at each time point (St), and (3) the remaining noise/variability at each time point (et). One way to think about time series data is that the behavior at time t (the output of a model; Bt) is 3
The data for these plots can be found in the accompanying online Excel document available here: https://github.com/david-j-cox/SupplMat-Stats-ABA/. The Jupyter notebook used to generate these plots can be found in the accompanying notebook titled, “time-series-demo.ipynb.” There’s also an html version under the same name for those who are unable to open Jupyter notebooks. 4 Interested readers can follow along in the accompanying Excel document available here: https://github. com/david-j-cox/SupplMat-Stats-ABA/, too. Though many statistical packages make this much easier.
This math and time thing is cool! Time series decomposition and forecasting behavior
231
Figure 9.1 Example time series graph of the number of programs mastered by a client over time (top panel) and decomposed into individual trend, seasonality, and noise components.
determined by (5) adding trend, seasonality, and noise together. As an equation: Bt 5 Tt 1 St 1 et :
(9.1)
Stated differently, we can predict the exact level of behavior at any one point in time if we know the exact amount behavior is trending at that point in time and the effect of seasonal patterns (in the technical sense, not in the sense of the Earth’s revolution about the Sun on a tilted axis).
232
Statistics for Applied Behavior Analysis Practitioners and Researchers
The purpose of decomposition is to identify patterns specific to each component. Often, the first decomposition step is to determine and remove the trend from the dataset. Visually, a good decomposed trend is one that looks as close to a straight line as possible. Once trend is determined, anything “remaining” can be further decomposed into additional time series characteristics (e.g., seasonality, noise; Eq. 9.1). Thus we first need to figure out how to extract the trend statistically from the data. Isolating a trend in the data is often accomplished by first filtering out variability in the data (don’t worry, we bring it back here in a minute; you don’t have to say goodbye). Fig. 9.2 shows what the trend looks like using moving average filters of 3, 6, 9, 12, 15, and 18 months (panel titles). A moving average captures change in data over time by calculating the arithmetic mean of a subset of data points before and after each data point in a time series dataset. The number of values used to calculate the arithmetic mean is determined by the period length we have chosen. For example, a moving window average of 6 months (top middle panel) involves calculating the arithmetic mean of the values spanning 3 months before and 3 months after each month. The resulting set of arithmetic averages using the 6-monthperiod filter are what are plotted in the panel. Note the trend is not a line resulting from linear regression (see Chapter 8 for how linear regression is used in other statistical approaches to describing time
Figure 9.2 Influence of varying seasonality durations (titles) on the trend of the average number of programs mastered per week for a hypothetical client.
This math and time thing is cool! Time series decomposition and forecasting behavior
233
series data). Instead, it is a smoothed average of the data around that point in time. Visual analysis of Fig. 9.2 suggests that the moving average window of 12 months is the ideal period for isolating the trend for these data. The purpose of decomposing (i.e., isolating) trend is to remove seasonality and variability leaving only trend. The 3-, 6-, and 9-month moving average windows still have quite a bit of variability in them making them poor candidates as filters for this dataset. At 12 months, however, we get close to a straight line. Beyond 12 months, variability in the moving average data increases suggesting moving average windows greater than 12 months also do a poor job of capturing trends in the data. Alas, a 12-month-period filter is our golden goose and we know we can use it to isolate the trend for these data. Ergo, we use a 12-month filter to represent the trend in the top panel of Fig. 9.1. The next step is to decompose (i.e., isolate) seasonality. This is accomplished through two steps (Table 9.1). First, we subtract the trend value from the original data (i.e., subtract the second panel from the first panel in Fig. 9.1; “Difference (raw 2 trend)” column in Table 9.1). Second, we calculate the average difference between the raw data and our trend after grouping by a time period over which we suspect that seasonality might occur. One way to do this is by creating an autocorrelation5 plot. For example, Fig. 9.3 shows an autocorrelation plot for the data we have been playing with so far. A classic giveaway of seasonality is when a sinusoidal pattern is present (i.e., ). In Fig. 9.3, the peak of that pattern is at 12 months. So, to calculate seasonality, we would calculate the average once for all values from January, once for all the values from February, and so on (“AVG periodic difference” column in Table 9.1). The resulting set of values that capture seasonality can then be graphed for analysis. The seasonality of the data we have been working with are plotted in the third panel of Fig. 9.1. Visually, these data pass the intraocular assault test—the pattern hits you right between the eyes. Counting the 5 In behavior analysis, strongly autocorrelated data is sometimes considered a problem to be handled given its impact on single-subject statistical tests of differences between conditions (see Chapter 8). However, as with all tools, autocorrelation is quite handy in the right context such as identifying seasonality in time series data. This is another example in this textbook wherein we hope you, dear readers, are beginning to see all this statistics and math stuff as tools, plain and simple. They are “good” or “bad” only depending on context and how they are being used.
234
Statistics for Applied Behavior Analysis Practitioners and Researchers
Table 9.1 Example showing the top and bottom 12 rows of data from Fig. 9.1 and how it is decomposed using a 12-month moving window. The Excel file accompanying this book contains the full dataset. Month
1 2 3 4 5 6 7 8 9 10 11 12 : 1 2 3 4 5 6 7 8 9 10 11 12
Year-month
Jan-2010 Feb-2010 Mar-2010 Apr-2010 May-2010 Jun-2010 Jul-2010 Aug-2010 Sep-2010 Oct-2010 Nov-2010 Dec-2010 : Jan-2021 Feb-2021 Mar-2021 Apr-2021 May-2021 Jun-2021 Jul-2021 Aug-2021 Sep-2021 Oct-2021 Nov-2021 Dec-2021
Difference
AVG periodic
mastered per week
AVG targets
Trend
(raw 2 trend)
difference
3.73 3.93 4.40 4.30 4.03 4.50 4.93 4.93 4.53 3.97 3.47 3.93 : 13.90 13.03 13.97 15.37 15.73 17.83 20.73 20.20 16.93 15.37 13.00 14.40
3.73 3.93 4.40 4.30 4.03 4.50 0.74 0.70 0.24 20.34 20.81 20.43 : 21.64 22.66 21.59 20.18 0.22 2.14 20.73 20.20 16.93 15.37 13.00 14.40
20.63 20.96 0.24 0.15 0.31 1.51 3.71 3.66 1.89 0.64 20.57 0.22 : 20.63 20.96 0.24 0.15 0.31 1.51 3.71 3.66 1.89 0.64 20.57 0.22
4.19 4.23 4.29 4.29 4.28 4.36 : 15.54 15.69 15.56 15.55 15.51 15.69
Noise
22.97 22.95 21.64 20.97 20.25 20.65 : 21.01 21.70 21.83 20.33 20.09 0.63
Figure 9.3 Autocorrelation plot showing the classic sinusoidal pattern indicative of seasonality in a dataset.
This math and time thing is cool! Time series decomposition and forecasting behavior
235
months beginning with the first marker (i.e., January of 2010), we see that every June through August has the highest rate of mastery per week; every November has the lowest (nearly two fewer targets mastered per week than the rest of the year); December through February is also another set of months with fewer targets mastered than the rest of the year though not as low as November; and March through May are back around average. As a quick note, these data were fabricated to fit this kind of pattern. Not all instances of decomposition look this clean. However, the pattern is certainly conceivable given the academic calendar in North American schools. And, it also shows how simple statistical decomposition of time series data can make trend and seasonality much easier to see. Given the simple decomposition model we used in Eq. 9.1, the only remaining term is noise or variability or error or whatever else you want to call the leftovers. The fourth panel in Fig. 9.1 shows the remaining noise (i.e., variability) in the data that is unexplained by trend and seasonality6. To calculate these data, the trend and seasonal values are simply subtracted from the raw data ( “Noise” column in Table 9.1). There are two benefits to isolating the remaining variability in this way. First, we can use very simple descriptive statistics to quantify and precisely describe behavioral variability because we have already accounted for trend and seasonality. In itself, this is useful and avoids much of the hassle in trying to account for variability that we saw in Chapter 8 using more traditional statistical methods. The second benefit is we can better determine if the variables we have included account for our data well. For example, the pattern in the variability in the fourth panel in Fig. 9.1 indicates there are additional influences on the rate that programs are mastered for this client. Specifically, there are three distinct sections of this plot. One from January of 2010 through January of 2015; one from January of 2015 through January of 2018; and one from January of 2018 through the end of 2021. If we were the supervising behavior analysts for these cases, we might dive into case history documents around both time points to figure out what exactly happened. Further, the pattern of increasing variability over the last few years suggests something is influencing the rate of program mastery outside of what we have 6 1000 extra credit points to the reader if you can correctly state why the y-axis on this graph is labeled “residual.” Hint: think back to Chapter 6.
236
Statistics for Applied Behavior Analysis Practitioners and Researchers
considered thus far and may have an increasingly large influence. What is that? And, how can knowing what that is help us improve the services we deliver to this client?
Variations on a theme There were three primary decisions that we made above that demonstrate different ways that we can decompose time series data into its parts. The first was how we chose to combine those components back together to predict behavior. Specifically, we assumed that the components of the decomposed model combined additively to predict behavior (i.e., are independent of one another; Eq. 9.1). Other common methods for recombination of the individual time series components are called multiplicative and mixed. The second decision surrounded how we chose to aggregate data to isolate the trend. As you saw above, most decomposition models begin by isolating the trend which makes this decision have a direct impact on the remaining characteristics. We chose to use a simple moving average but many alternative methods exist. The third decision we made was which time series characteristics we chose to include in the model. We chose a simple model that included three components in trend, seasonality, and noise. But, there’s no reason we couldn’t also isolate level, cyclicity, and stationarity. In the following sections, we briefly highlight variations on each of these decisions to show some of the many creative ways that behavior analysts can think more quantitatively about their time series data. How do the components combine? As described in the additive Eq. 9.1, the assumption we made was that trend, seasonality, and noise capture unique and independent characteristics of how behavior has changed over time. Because trend, seasonality, and remaining variability are independent of each other, the assumption is that changing one has no impact on the others. That is, if the trend changed, the seasonal patterns would remain unchanged as would the noise. Two common alternative models to the additive model of Eq. 9.1 are assuming the components combine multiplicatively or in a mixed fashion. The multiplicative model assumes that the included components interact with each other to predict behavior. That is, changes in one of these components will be amplified (i.e., multiplied) by any changes or
This math and time thing is cool! Time series decomposition and forecasting behavior
237
values in the others. The multiplicative decomposition model with trend, seasonality, and noise can be written as: Bt 5 Tt 3 St 3 et :
(9.2)
All the components are identical to Eq. 9.1, we are just now multiplying the data by each other as opposed to adding them together. Fig. 9.4 shows what the trend, seasonality, and remaining noise (i.e., residuals, variability) look like when we apply a multiplicative
Figure 9.4 Example time series data showing the difference between additive (black markers; identical to Fig. 9.1) and multiplicative decomposition models (blue markers) of the average number of mastered programs per week.
238
Statistics for Applied Behavior Analysis Practitioners and Researchers
decomposition model (blue markers and data paths) as opposed to an additive model (black markers and data paths) for the same data. The trend is identical and overlapping which makes sense as a 12-month moving average of our data wouldn’t change. But, the multiplicative model has much smaller variability in seasonality and almost no variability in the remaining noise. Here, you can use any of the metrics from Chapter 6 to best compare the two models and quantify how much the multiplicative model improves your descriptions of behavior over time. The final common model type is termed a mixed model. This model often takes the following form (e.g., Nwogu et al., 2019): Bt 5 Tt 3 St 1 et :
(9.3)
Here, the assumption is that trend and seasonality interact, but the remaining variability unaccounted for by trend and seasonality is independent of these components. For those who are really interested in diving into the weeds on choosing between additive, multiplicative, and mixed decomposition models, we recommend the papers (and resulting citation trail) by Dozie and Ijomah (2020) and by Nwogu et al. (2019). Variations in decomposing time series data In the example above, we used a simple moving average to quantify the trend in the data. In so doing, each value being used to calculate the average is technically given the same weight (i.e., has the same influence as all other values being used to quantify the trend). But, in some instances, we may want to weigh recent values more than distal values. The common approach here is to use some variation of exponential smoothing (e.g., simple exponential smoothing, Holt’s method, Holt-Winter’s method). The mathematical specifics here are well beyond the scope of this book. However, of note is that the HoltWinter’s method is the most commonly used as it captures level, trend, and seasonality simultaneously (see Gardner, 2006 for a recent treatment of exponential smoothing techniques and ideal use cases). The variations of exponential smoothing described above each use some kind of arithmetic mean of our data. As noted in Chapter 3, however, outliers can lead these estimates of our data to be biased in the direction of the outlier. A common technique to exclude data outside of a specific range is referred to as time series filter. Kind of like a
This math and time thing is cool! Time series decomposition and forecasting behavior
239
trimmed mean, the idea here is to filter out the fluctuations in the data that may impact our estimates of trend, seasonality, cyclicity, and the remaining noise. Common filters found in many statistical packages include Baxter-King bandpass filter (Baxter & King, 1999); the Hodrick-Prescott filter (Hodrick & Prescott, 1997; Ravn & Uhlig, 2002); the Christiano Fitzgerald asymmetric, random walk filter (e.g., Christiano & Fitzgerald, 2003); recursive filtering (e.g., Kitagawa, 1981); and season-trend decomposition using locally weighted polynomial regression (i.e., LOESS; Cleveland et al., 1988; 1990). As with above, the details behind each of these are well outside the scope of this book. The point here is to introduce you to these techniques so that you at least know that tools exist to help you the next time you want to model time series data that contains outliers. Adding additional components The model we built above included only trend, seasonality, and noise. But, as we have alluded to in the previous two sections, other models exist that include cyclicity (e.g., Nwogu et al., 2019) as well as level (e.g., Holt-Winter’s method). These additional components can be included additively, multiplicatively, or via mixed models. The specific choice of components to include will likely depend on the data you have, the assumptions you are willing to make, and the likelihood that the different components are readily observable in your data. Nevertheless, as the example in Fig. 9.4 shows, it never hurts to try out different models to determine which one seems to be performing best with the data you have collected.
Forecasting behavior Now that we have our bearings on how to decompose time series data into its parts, we can get to the fun part in predicting future behavior. As mentioned in previous chapters, though behavior analysts often talk about description, prediction, and control, rarely do we see people publish actual predictions about behavior expected to be observed in the future. We might make general claims such as “behavior will increase when we use reinforcement,” or “behavior will decrease when we use extinction.” But, these are simply restatements of physical phenomena and are akin to claiming the “ball will drop to the Earth when I let it go.” That’s technically true, but anyone can make that kind of prediction. It’s a different level of sophistication to predict the exact
240
Statistics for Applied Behavior Analysis Practitioners and Researchers
angle and velocity needed to launch a satellite into space based on its weight and fuel capacity so that it can escape Earth’s atmosphere, avoid careening into the depths of space, and circle the Earth at a distance useful for us to type this book in “the cloud.” Fortunately for us, people have been attempting to predict the future for millennia. No wheels need to be reinvented to use these tools, per se. However, we suspect that behavior analysts will likely improve on the general approaches mentioned in this section as they learn how to apply them to their everyday clinical, business, and research projects. As noted in the introduction to this chapter, we are unaware of any behavior analysts doing this work currently. But, in the following, we do our best to highlight relevant examples where these approaches might offer something useful. Kind of like the decomposition models mentioned earlier, behavior analysts regularly play with time series data already. And, we suspect many of you readers are much smarter than us and will likely find ways to work these into your milieu of techniques, magic spells, and potions. We have one final note before we get into the goodies of this section. Entire books have been written around time series forecasting (e.g., Hyndman & Athanasopoulos, 2018; Strickland, 2020; Wilson & Keating, 2018). In no universe, we can provide a comprehensive review of this information, the many variations, the math behind it all, nor tutorials on how to implement it. As a result, what follows is an introduction to the main ideas to whet your appetite and get you dreaming. And so, without further ado, let us begin by predicting your future behavior: you will read the word that follows the end of this sentence.
Exponential smoothing We already discussed exponential smoothing in the previous section on time series decomposition so we’ll start with the familiar. As a reminder, exponential smoothing is accomplished by simply taking the weighted arithmetic mean of past observations. That’s the “smoothing” bit. The exponential piece comes by providing less and less weight to observations the further they are from the time period we want to forecast; and where we reduce the weight of older observations exponentially. Simple exponential smoothing The simplest of all exponential smoothing methods is aptly called simple exponential smoothing. This method is most often used when there
This math and time thing is cool! Time series decomposition and forecasting behavior
241
Figure 9.5 Example of forecasting using simple exponential smoothing with stationary data (top panels) and methods that include trend for trending data (bottom panels). The black markers represent the raw data; the blue line represents the estimated and forecasted behavior; the gray dashed line represents a forecast using the arithmetic mean. NB: For the exponential smoothing forecasts involving trend, the smoothing level was set 0.1 and the smoothing trend was set to 0.2 for these plots.
is no trend or seasonal pattern in your time series data (Hyndman & Athanasopoulos, 2018). For example, the data represented by the black markers in the top panels of Fig. 9.5 are stationary (i.e., no trend, similar variability throughout). To make a forecast, we just have to pick how fast we want the weight of older observations to decay in their influence on our prediction. The number we use to specify this is labeled ɑ (spoken as “alpha”)7. ɑ can range between 0 and 1. ɑ 5 1 simply uses the original dataset without decreasing the weight of older observations, and ɑ 5 0 is a flat line prediction based on the most recent level of behavior. In practice, ɑ is often set between 0.1 and 0.3 (NIST, 2003). Each of the plots in the top panel of Fig. 9.5 show how changing alpha alters how smooth our estimate of data is and the resulting predicted forecast of behavior. Exponential forecasting with trends You may have noticed that the forecast of the simple exponential smoothing is a straight, flat line. Such a forecast is obviously not 7 For the mathematically adventurous, here’s the equation: yT11|T 5 ɑ(1 2 ɑ)yT21 1 ɑ(1 2 ɑ)2yT22 1 . . . 1 ɑ(1 2 ɑ)nyT2n. If you’re following it, you can see how the influence of each time point (yT2n) decreases exponentially based on the number of observations it is from the target value we want to predict.
242
Statistics for Applied Behavior Analysis Practitioners and Researchers
helpful when our data has a trend or seasonality. Because of how common data are that contains a trend and seasonality, researchers have figured out ways to incorporate these characteristics into their forecasts. As with decomposition models, the trend and seasonality components can be incorporated in an additive manner or in a multiplicative manner. The lower panels in Fig. 9.5 show the dataset we have been playing with throughout the chapter along with three common exponential forecasting methods for data with trends and seasonality. In each instance, an assumption was made about how the data would continue to trend into the future (e.g., linearly or exponentially). Lastly, a class of methods include a parameter in the equation that “dampens” the trend to an eventual flat line (e.g., Gardner & McKenzie, 1985). The idea here is that few things in life will continuously increase forever. For example, with behavior, there is a certain point where an organism just simply cannot respond any faster, for any longer, with any greater force, or with any shorter latency. Dampening techniques help forecasted levels of behavior level off, which often improves the accuracy of long-term forecasts.
Autoregressive Integrated Moving Average models Exponential smoothing models forecast using statistical descriptions of the trend and seasonality that exist in the data. In contrast, Autoregressive Integrated Moving Average (ARIMA) models make forecasts based on descriptions of autocorrelations present in a time series dataset. As the name suggests, ARIMA models predict behavior using autoregressive models (we’ll get to what this means later) and moving averages (discussed previously but used later in a fun new way). Before we get to the fun modeling bit, though, we have to do one quick thing and determine whether our data are stationary or not and, if they are not, we have to make them so. Differencing our data Whether data are stationary or nonstationary is another characteristic of time series data that behavior analysts are likely somewhat familiar with. By definition, a time series dataset is considered stationary if its descriptive statistical properties (e.g., mean, variance, serial correlation; Chapters 3 and 4) remain constant over time. In behavior analytic vernacular, we might call such data “stable.” However, we’ll
This math and time thing is cool! Time series decomposition and forecasting behavior
243
use the term stationary as most behavior analysts do not define precisely (read quantitatively) what they mean by stable, nor do they typically describe the formal statistical properties of their data when it is “at stability.” If your data are stationary, congrats! Jump ahead, pass go, and collect $1 million. If your data have an obvious trend or seasonality to them (e.g., the data in the top panel of Fig. 9.1), no worries. We can make them stationary. Many techniques exist to take nonstationary data and make it stationary through some kind of transformation. One common method for doing this is termed differencing. Essentially, the idea is to not analyze the raw data, but to analyze the data based on how it changes over some period of time. That is, we calculate the difference between each data point and the data point N periods ago. If you ever took a calculus class, this will be extremely intuitive as we are analyzing rates of change over time. Also, just like in calculus, sometimes first-order differencing does not make our data stationary and we need to difference the differenced data (i.e., second-order differencing), difference the difference of our differenced data (i.e., third-order differencing), or maybe play with things like seasonal differencing. Fig. 9.6 shows what the first-order differencing over various time periods does to the data from the top panel of Fig. 9.1. The data in the top two panels show how simple differencing removes the trend quite nicely from the data. Overall, differencing by a period of 1 got us as close to having stationary data with the smallest amount of variability over time. However, it is still not perfect because all of the plots have either increasing variability (periods 1, 3, 6, and 9) or seasonal trends are still present (period 12). Remember that stationary data should have a representative arithmetic mean and constant variability. Thus one final common transformation forecasting researchers and practitioners can use is the log-transform8. The bottom two panels of Fig. 9.6 show the same data as the top two panels but following a logtransform. My, oh my! Look how nice those data look after having been differenced with a period of 1 following a log-transform! Take my breath away!
8 Another 1000 extra credit points to the reader who first remembers what behavior analytic data we often take the log-transform of before fitting the model!
244
Statistics for Applied Behavior Analysis Practitioners and Researchers
Figure 9.6 Time series data with trend and seasonality that has been transformed to stationary for easier analysis using differencing. Please check the online version to view the color image of the figure.
Autoregressive models Now that our data are stationary, we can get into the meat of the ARIMA model. The first bit is to create an autoregressive model. Though this might sound scary, it’s actually quite straightforward. In Chapter 6, we discussed regression models and how behavior analysts
This math and time thing is cool! Time series decomposition and forecasting behavior
245
can use multiple independent variables to predict behavior. With autoregressive models, instead of using various independent variables to predict future behavior, we use past values of behavior from the same dataset to predict future behavior. Auto 5 self, regressive 5 regression. So, autoregressive just means using the self to predict one’s future. To build an autoregressive model, all the behavior analyst has to do is to specify how much of the past behavior of the individual (B) we want to include in the model. In the jargon of statistical modeling with time series data, you have to choose the number of lagged values that will be used in the model (p). As an equation: Bt 5 c 1 φ1 Bt21 1 φ2 Bt22 1 ? 1 φp Bt2p 1 et :
(9.4)
So, for example, if we only want to use our measure of behavior from the most recent observation, we would create an autoregressive model of the order p 5 1. That is, we are simply using a weighted value of behavior from one time step ago (i.e., Bt 5 c 1 φ1Bt21 1 et). An autoregressive model of the order p 5 5 would mean that we use a weighted value of behavior from the last five time steps to predict future behavior (i.e., Bt 5 c 1 φ1Bt21 1 φ2Bt22 1 φ3Bt23 1 φ4Bt24 1 φ5Bt25 1 et); and so on. Moving average models The second bit of ARIMA models is the moving average component. If you recall from earlier in the chapter, a moving average simply calculates the arithmetic mean of values around each data point in our time series dataset. With ARIMA models, the moving arithmetic mean of values we calculate are the errors that were made in predictions using a regression model to predict behavior. Stated differently, we can create a moving average of how far off our predictions are using a simple regression model. As an equation, this gets written as: Bt 5 c 1 et 1 φ1 et21 1 φ2 et22 1 ? 1 φq et2q :
(9.5)
Just as with the autoregressive model above, we select the number of lagged errors that will be used in the model (q). ARIMA! You’ve got the basics, now let’s put them together! Let’s Integrate them—Auto Regression Integrated with a Moving Average! ARIMA! The only final note is that our data needs to be stationary. As described above, this means we are building an ARIMA model to predict our differenced data from above as opposed to the raw data. Here, the differenced data gets the term: B;t . And, to integrate the AR
246
Statistics for Applied Behavior Analysis Practitioners and Researchers
with the MA, we simply put one equation in front of the other equation and add a plus sign to link them: B;t 5 c 1 φ1 B;t21 1 ? 1 φp B;t2p 1 φ1 et21 1 ? 1 φq et2q 1 et :
(9.6)
That’s it! The left panel in Fig. 9.7 shows the forecasted behavior using an ARIMA model based on the log-transformed differenced data with period 1 (Fig. 9.6) and using p and q values set to 24. What’s readily observable is that the forecasted behavior has all the fun wriggles and jiggles (variability) that the data coming before it had. This differs quite substantially from the smooth, near linear forecasted values in the previous models. Lastly, the model we built above did not have any seasonality (because we differenced and transformed the dataset so that it didn’t). But, given how frequently data with seasonality are in the wild, and also given that not all data are as easily “cleaned” for analysis, sometimes we end up with data that, despite our best attempts, are not stationary. For these situations, researchers have also come up with ARIMA models that capture seasonality and trend. Appropriately, these are termed Seasonal ARIMA models or SARIMA models. These models handle predictions by adding additional seasonal terms to the equation (see Hyndman and Athanasopoulos, 2018 for the full mathematical treatment). This makes the model much more complex (see Chapter 6 for discussion around parsimony). But, sometimes you gotta do what you gotta do. The right panel in Fig. 9.7 shows a SARIMA model making predictions on the original dataset from Fig. 9.1. Note, again, the
Figure 9.7 Examples of the forecasts (blue lines) using ARIMA (left panel) and SARIMA (right panel).
This math and time thing is cool! Time series decomposition and forecasting behavior
247
difference in how this model is making predictions compared to the simple exponential smoothing models in Fig. 9.5.
Chapter summary Many researchers and practitioners from a wide variety of disciplines regularly use time series data to make important decisions about their area of practice. Similar to behavior analysts, they are often interested in breaking down time series data into the components of trend, level, and variability. Different from behavior analysts, they often do this directly, quantitatively, and also account for time series data components such as seasonality (regularly repeating patterns in time series data) and cyclicity (irregular or unpredictable cycles within time series data). Methods that separate time series data into its components are referred to as decomposition models or decomposition methods. Such models provide a quantitative description of time series data with greater precision than the textual descriptions behavior analysts often use (e.g., “the data have an upward trend”; “variability decreased between conditions”). Researchers and practitioners from disciplines distinct from behavior analysis have also developed methods for precisely predicting future behavior from the time series data available to them. These are referred to as forecasting methods and are broad and deep enough in their content to typically warrant book-long treatments. We reviewed two of the most common methods in exponential smoothing and ARIMA/SARIMA models. At the time of writing this book, we are unaware of any behavior analysts who have published research articles using these methods in the past. However, given their prominence and success in other disciplines that also describe and predict human behavior, we suspect behavior analysts have just not found the right job for these tools. We hope the introductory material offered in this chapter will spark broader interest and enthusiasm for this topic so that people smarter than us can show us what fun things can be done using these approaches.
References Baxter, M., & King, R. G. (1999). Measuring business cycles: Approximate band-pass filters for economic time series. Review of Economics and Statistics, 81(4), 575 593. Available from https:// doi.org/10.3386/w5022.
248
Statistics for Applied Behavior Analysis Practitioners and Researchers
Christiano, L. J., & Fitzgerald, T. J. (2003). The band pass filter. International Economic Review, 44(2), 435 465. Available from https://doi.org/10.1111/1468-2354.t01-1-00076. Cooper, J. O., Heron, T. E., & Heward, W. L. (2020). Applied behavior analysis (3rd ed.). Merrill-Prentice Hall. DeProspero, A., & Cohen, S. (1979). Inconsistent visual analyses of intrasubject data. Journal of Applied Behavior Analysis, 12(4), 573 579. Available from https://doi.org/10.1901/jaba. 1979.12-573. Dozie, K. C. N., & Ijomah, M. A. (2020). A comparative study on additive and mixed models in descriptive time series. American Journal of Mathematical and Computer Modeling, 5(1), 12 17. Available from https://doi.org/10.11648/j.ajmcm.20200501.12. Ford, A. L. B., Rudolph, B. N., Pennington, B., & Byiers, B. J. (2020). An exploration of the integrated agreement of visual analysis with and without context. Journal of Applied Behavior Analysis, 53(1), 572 583. Available from https://doi.org/10.1002/jaba.560. Gardner, E. S. (2006). Exponential smoothing: The state of the art Part II. International Journal of Forecasting, 22(4), 637 666. Available from https://doi.org/10.1016/j.ijforecast.2006.03.005. Gardner, E. S., & McKenzie, E. (1985). Forecasting trends in time series. Management Science, 31(10), 1237 1246. Available from https://doi.org/10.1287/mnsc.31.10.1237. Hodrick, R. J., & Prescott, E. C. (1997). Postwar U.S. business cycles: An empirical investigation. Journal of Money, Credit and Banking, 29(1), 1 16. Available from http://www.jstor.org/stable/ 2953682?origin 5 JSTOR-pdf. Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: Principles and practice (2nd ed.). OTexts, ISBN: 978-0987507112. Kahng, S. W., Ching, K. M., Gutshall, K., Pitts, S. C., Kao, J., & Girolami, K. (2013). Consistent visual analyses of intrasubject data. Journal of Applied Behavior Analysis, 43(1), 35 45. Available from https://doi.org/10.1901/jaba.2010.43-35. Kitagawa, G. (1981). A nonstationary time series model and its fitting by a recursive filter. Journal of Time Series Analysis, 2(2), 103 116. Available from https://doi.org/10.1111/j.14679892.1981.tb00316.x. Maslow, A. H. (1966). The psychology of science: A reconnaissance. Harper & Row, ISBN: 978-08092-6130-7. Matyas, T. A., & Greenwood, K. M. (1990). Visual analysis of single-case time-series: Effects of variability, serial dependence, and magnitude of intervention effects. Journal of Applied Behavior Analysis, 23(3), 341 351. Available from https://doi.org/10.1901/jaba.1990.23-341. McKinney, W. (2010). Data structures for statistical computing in Python. In S. van der Walt, & J. Millman (Eds.) Proceedings of the 9th Python in Sciences Conference (pp. 56 61). Available from https://doi.org/10.25080/Majora-92bf1922-00a NIST. (2003). Exponential smoothing. Retrieved from: https://www.itl.nist.gov/div898/software/ dataplot/refman2/auxillar/exposmoo.htm#:B:text 5 ALPHA%20is%20the%20smoothing% 20parameter,series%20is%20the%20original%20series. Nwogu, E. C., Iwueze, I. S., Dozie, K. C. N., & Mbachu, H. I. (2019). Choice between mixed and multiplicative models in time series decomposition. International Journal of Statistics and Applications, 9(5), 153 159. Available from https://doi.org/10.5923/j.statistics.20190905.04. Ravn, M. O., & Uhlig, H. (2002). Notes on adjusted the hodrick-prescott filter for the frequency of observations. The Review of Economics and Statistics, 84(2), 371 380.
This math and time thing is cool! Time series decomposition and forecasting behavior
249
Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with python. Proceedings of the 9th Python in Science Conference. Skinner, B. F. (1972). A lecture, “On having a poem”. In B. F. Skinner (Ed.), Cumulative Record: Third Edition (pp. 345 355). Appleton-Century-Crofts. Strickland, J. (2020). Time series analysis and forecasting using Python and R. Lulu.com, ISBN: 978-1716451133. Wilson, J. H., & Keating, B. (2018). Forecasting and predictive analytics with ForecastX. McGraw Hill, ISBN: 978-1260085235.
CHAPTER
10
I suppose I should tell someone about the fun I’ve had: Chapter checklists for thinking, writing, and presenting statistics Your value will not be what you know; it will be what you share. Ginni Rometty
Introduction Rarely do behavior analysts develop, implement, and analyze their work in a vacuum, completely separated from other humans. Before a research study or an intervention is conducted, they must attain approval of their proposed work by an Institutional Review Board or the client (informed consent or assent). And, at some point after the onset of research or an intervention, behavior analysts must disseminate their findings if they hope to do more of similar work in the future. For the researcher, dissemination may come in the form of presenting their research at lab meetings and scientific conferences, or writing manuscripts for peer-reviewed journals. For the practitioner, dissemination may come in the form of vocal-verbally, graphically, or textually communicating the outcomes of the implemented intervention to the client and whoever is paying for the intervention. Dissemination can be conceptualized as the interaction between response forms (e.g., presentations, publications) controlled by the multiple variables that characterize the audience with whom one is communicating (e.g., colleagues, payors, client). Different audiences likely will expect different forms of statistical presentation (including you as an audience for your own analytic behavior!). For example, other behavior analysts will likely prefer time series plots of descriptive statistics; researchers familiar with group designs will likely prefer inferential statistics and effect sizes; health insurance plans will likely Statistics for Applied Behavior Analysis Practitioners and Researchers. DOI: https://doi.org/10.1016/B978-0-323-99885-7.00013-1 © 2023 Elsevier Inc. All rights reserved.
252
Statistics for Applied Behavior Analysis Practitioners and Researchers
prefer descriptive, inferential statistics and effect sizes; and all might be interested in social significance capturing client or participant satisfaction around progress made relative to what would have happened without intervention. The purpose of this chapter is to provide a compact review of everything discussed in this book to help you think, write, and present the descriptive and inferential statistics that are part of your research or intervention. We obviously can’t know every person’s preferences for every audience everywhere and at all points in time. So this chapter will necessarily be incomplete in some regard. However, given the audience of this book is likely behavior analysts newer to expliciting thinking about statistics when they write or present, the form this chapter will take centers around what (we hope) is a practical function. Specifically, we summarized everything covered in the book in two ways. First, the content of the book is reviewed in the form of a list of questions you can ask yourself and reminders around recommended dos and don’ts. Second, we sequenced the content in the prototypical manner in which data are likely aggregated, analyzed, interpreted, and presented to your audience of interest. This chapter is also not many things. First, this chapter isn’t a guide on being a better writer or presenter. Many texts already exist for this purpose (our recommendations include Feldman & Silvia, 2010; Friman, 2014; Heinicke et al., 2022; Schimel, 2012; Silvia, 2014, 2017; Strunk & White, 1999; Valentino, 2021). Second, this chapter assumes you have already created a practical and accurate operational definition of the behavior you are interested in analyzing. Lastly, this chapter assumes that you have already identified the most appropriate and practical method of data collection for the behavior you are interested in analyzing. Statistics begin once you have collected data and are ready to aggregate, analyze, interpret, and present your results1. We hope this chapter provides a succinct guide on being a better statistical writer and presenter so that your audience is not left wanting.
1 We recognize that this contrasts with the argument of many statisticians who would prefer the statistical tests you’ll need to run to guide your data collection process and experimental design. Our preference is that researchers first figure out the best way to answer a question they have, let that guide data collection and experimental design, and then let statistics deal with the aftermath.
Checklists for thinking, writing, and presenting statistics
253
Checklist and questions to answer when writing/presenting about statistics Data provenance 1. You lose information when you convert your observations into numbers. What information are you losing with your method of data collection? Is this information important? Have you communicated to the audience the important information lost and why it is unlikely to impact your analysis? 2. What is the chain of behaviors that transposes your observation of behavior and environmental occurrences into an electronic format for analysis? Are there any points in this chain whereby errors or inaccuracies might occur? If so, how are you mitigating those errors or inaccuracies? 3. If your data are stored electronically, do you have appropriate backups of your data in case something happens to the dataset you primarily work from? How often do you make backup copies? 4. Are you documenting the steps you take to move and aggregate your data such that someone else could replicate your results exactly with the same raw dataset? 5. Are there any missing data in your dataset? Are there any outliers or oddities that need to be handled? How are you handling missing or atypical data? How does your choice of method for handling missing or atypical data impact the results of your analyses? Have you made these decisions explicit for your audience? 6. Are you combining more than one data type? If so, are you doing this consistently (e.g., using similar denominators) and making this explicit for your audience? 7. For modeling: Do you need to conduct feature scaling? If so, which method did you use and have you made this explicit for your audience?
Descriptive statistics 8. Remember that readers with some kind of statistical learning history will have expectations around what kinds of things are considered and included in the write-up. Who is the audience? How can you talk about your data to best help them learn what you have learned about behavior-environment relations or the client with whom you work?
254
Statistics for Applied Behavior Analysis Practitioners and Researchers
9. What is the type of data you have collected? (Table 2.1) 10. What is the distribution of the data you have collected? (Table 2.3; Figure 2.3) 11. At what level are you choosing to aggregate your data? Is this the best level for answering your question? Would a different level provide different insight? (Figure 3.1) 12. What is the most appropriate measure of central tendency for your data based on its type and distribution? (Figure 3.3) 13. When describing spread, are you trying to communicate how well you know the measure of central tendency or the likely variability someone might expect? (Tables 4.2 and 4.3) 14. What are the most appropriate measure(s) of spread for your data based on its type, distribution, and the function of your communicating spread? (Tables 4.2 and 4.3) 15. Many measures of data spread involve a data transformation of some kind. Does yours? If so, what are the transformations that occur? How does this impact how you interpret your measure of spread? Have you communicated this impact to your audience? 16. Did you choose the descriptive statistics based on historical convention? Logically based on the data type, distribution, and questions being asked? Both? Neither? Unknown?
Inferential statistics 17. Remember that readers with some kind of statistical learning history will have expectations around what kinds of things are considered and included in the write-up. Who is the audience? How can you talk about your data to best help them learn what you have learned about behavior-environment relations or the client with whom you work? 18. Will your audience expect/prefer tests of statistical significance? If so: a. What are the models/hypotheses being tested? Have you made these explicit? Are you presenting them textually, graphically, or mathematically? Can you present them in more than one way to better accommodate more than one audience? b. Which statistical test best aligns with the data you have collected and comparisons you want to make? (Figure 5.1) c. How did you choose your p value threshold? Have you made this explicit for your audience?
Checklists for thinking, writing, and presenting statistics
255
d. Remember that p values tell you the probability that the null hypothesis is true given the data you have collected. Interpret all results accordingly and report an exact p value instead of a threshold. 19. Will your audience expect/prefer measures of effect size(s)? If so: a. Which effect size measure makes the most sense based on your data and what you are trying to communicate? b. Have you made explicit whether you’re reporting on measures of difference or measures of association/correlation? c. Have you interpreted your effect size in a manner relatable to the behavior you are studying (e.g., what does a Cohen’s d of 0.42 mean for the behavior-environment relations you studied)? 20. Will your audience expect/prefer measures of social significance? If so: a. Are they most interested in goals, procedures, effects, or something else? b. Have you measured social significance in a manner that is unbiased? Have you communicated to your audience how you reduced response bias?
Modeling 21. The world is a complex web of interacting pieces. What important variables might influence behavior that you were unable to physically control as part of your intervention or experiment? How are you measuring those variables? 22. What is the function of your model, description or prediction? Have you made this explicit for your audience? 23. What kind of model is most appropriate for the question you want to ask and your data (Table 6.2)? Is your model focused on regression or classification? How many independent variables (IVs) are being used to predict a dependent variable? 24. What fit/loss metric did you choose? Why did you choose it? (Table 6.3; Table 6.5) 25. Have you interpreted your fit/loss metric in a manner relatable to the behavior you are studying (e.g., what does an r2of 0.86 or a Mean Absolute Error of 1.3 mean for the behavior-environment relations you studied)? Have you convinced your audience that the fit/loss metrics suggest your model is “good enough” such that the parameters are interpretable?
256
Statistics for Applied Behavior Analysis Practitioners and Researchers
26. What type of model did you create and what is its structure? How does this information impact how you interpret your free parameters? Have you interpreted your free parameters in a manner relatable to the behavior you are studying (e.g., what does a β 5 5.4 mean for the behavior-environment relations you studied)?
Sample size 27. The underlying question with all sample size estimates and power analyses is whether we have collected enough data to have a valid or reliable representation of the effect of IV(s) on behavior. How will your audience know that you have collected “enough” data? What was/is your strategy for making this claim? Have you made your definition of “stability” explicit and quantitative? 28. Sample size is a hierarchical concept. a. How do you know you collected data on “enough” responses to aggregate it into a single data point? Is the number of responses that aggregate into a single data point consistent across your graphs (i.e., does each data point "mean" the same thing)? b. How do you know you collected data on “enough” sessions/ observations to aggregate into a single condition? Is the number of sessions/observations within a condition consistent across your graphs? c. How do you know you collected data from “enough” people to aggregate into a generalized claim about the effectiveness of your intervention or IV manipulation? d. Note: consistency in #28a and #28b can be either raw number or relative to the stability criterion you set in bullet #27. 29. When answering the above, what proof will your audience likely require? Do you need to conduct a power analysis or will graphical analysis be sufficient? When are you expected to have done this—before, during, after the study? And, when is it most useful for you in planning your work to have conducted these?
Time series considerations 30. Time is an IV that adds wrinkles to quantitatively analyzing data. How are you ensuring that anyone, anywhere, with the exact same dataset will arrive at the exact same answer as you when interpreting the effect of your intervention?
Checklists for thinking, writing, and presenting statistics
257
31. Based on the data analytic strategy you used, what are the potential challenges that time adds to interpreting those analyses? How have you addressed those limitations? Have you communicated this to your audience such that they can trust the results you are claiming? 32. The number of professions that use time series data is legion. Are there techniques or methods from other disciplines that would be useful to you with your question? Is the audience expecting you to have used one of those methods because it is common to the circles they run within?
General consideration for efficiency 33. If writing/presenting for an academic audience, do you have a “Data and Analysis” section at the end of your methods where you can efficiently provide the level of information your audience will want based on the above 32 questions? 34. Have you made explicit the assumptions you made when analyzing your data and how that might impact the statistics you used and your interpretation of your results?
References Feldman, D. B., & Silvia, P. J. (2010). Public speaking for psychologists: A lighthearted guide to research presentations, job talks, and other opportunities to embarrass yourself. American Psychological Association. Friman, P. C. (2014). Behavior analysts to the front! A 15-step tutorial on public speaking. Behavior Analyst, 37(2), 109 118. Available from https://doi.org/10.1007/s40614-014-0009-y. Heinicke, M. R., Juanico, J. F., Valentino, A. L., & Sellers, T. P. (2022). Improving behavior analysts’ public speaking: Recommendations from expert interviews. Behavior Analysis in Practice, 15(1), 203 218. Available from https://doi.org/10.1007/s40617-020-00538-4. Schimel, J. (2012). Writing science: How to write papers that get cited and proposals that get funded. Oxford. Silvia, P. J. (2014). Write it up: Practical strategies for writing and publishing journal articles. APA LifeTools. Silvia, P. J. (2017). How to write a lot: A practical guide to productive academic writing (2nd ed.). APA LifeTools. Strunk, W., & White, E. B. (1999). The elements of style (4th ed.). Longman Publishers. Valentino, A. L. (2021). Applied behavior analysis research made easy: A handbook for practitioners conducting research post-certification. Context Press.
CHAPTER
11
Through the looking glass: Probability theory, frequentist statistics, and Bayesian statistics There are no facts, only interpretations. Friedrich Nietzsche
Introduction One chapter left! You’ve almost made it! Hopefully, you are well hydrated and can avoid cramping through the final stretch. To briefly review the race you’ve run, we began the book by defining statistics as the collection, analysis, interpretation, and presentation of quantitative data (Merriam-Webster, 2021). Though often discussed in behavior analysis relative to group design research, behavior analysts use statistics daily when they aggregate data at the session level, aggregate responding within and across experimental conditions to make a claim about an intervention effect, and aggregate responding across participants through meta-analyses and claims about “laws and principles of behavior.” As a branch of mathematics, statistics rely heavily on logic which, for our purposes, just means there are specific rules1 for how we can use numbers to describe behavior-environment relations. So what are these rules humans have to follow to use statistics “accurately”? Well, data can be collected around many different things which, when aggregated and graphed, can take all sorts of visual shapes. These are referred to as data types and data distributions (Chapter 2). The type and distribution of our data then influence the most accurate way we can describe the amount of behavior the reader might expect (Chapter 3) and how much behavioral variability occurs from observation to observation (Chapter 4). Behavior analysts rarely For example, the utility of numbers breaks down as soon as one person can claim 1 1 1 5 5 and another that 1 1 1 5 -3. But, by agreeing to some basic rules for how we’ll use numbers, then this wonderful human invention of mathematics allows us to do all sorts of mind boggling things. 1
Statistics for Applied Behavior Analysis Practitioners and Researchers. DOI: https://doi.org/10.1016/B978-0-323-99885-7.00001-5 © 2023 Elsevier Inc. All rights reserved.
260
Statistics for Applied Behavior Analysis Practitioners and Researchers
want to only describe their data from one session or one condition. Rather, we often want to compare data across sessions, across conditions, or across participants to get a sense of precisely what effect our intervention had on behavior (e.g., statistical significance, effect size, or social significance; Chapter 5). And sometimes we want to create equations to learn how many levels of a single independent variable (IV) or many IVs might precisely control behavior (Chapter 6). Everything in the preceding paragraph requires the analysis of data. And, because we don’t live in a vacuum, any system of data collection and analysis has constraints. One constraint is resources (e.g., time, people, money). Behavior analysts often need to practically know when they can stop collecting data because what they have collected accurately describes the behavior-environment relations of interest (Chapter 7). A second constraint surrounds time series data. That whole “time as an IV” thing adds some unique wrinkles to statistical analyses of behavior-environment relations that need to be handled (Chapter 8). Fortunately, behavior analysts are not the only ones who use time series data. Researchers and practitioners in other fields have learned how to statistically account for trend, level, and variability in practically useful ways (Chapter 9). Fruitful future research likely involves experimenting with these tools to see what has utility in behavior analysis and under what conditions. Finally, after all the above, we often want to tell others what we have observed and what it means for them and for the individuals with whom they work. To be most helpful to the greatest number of people, it typically is nice for you to follow some basic rules and provide some basic information when writing up your statistical results (Chapter 10). At this point, you are essentially done. No more is technically needed from you and you can likely rest easy knowing that you have run the statistical gauntlet and come out a victor. In this bonus chapter, however, we get to see what happens when we go through the statistical looking glass. For those unfamiliar with Carroll’s (1909) novel, Through the Looking-Glass is the sequel to Alice’s Adventures in Wonderland (Carroll, 1893). In Through the Looking-Glass, Alice (the protagonist) enters a world unique from our everyday experiences by climbing through a mirror (i.e., a looking glass) and, just like a reflection in a mirror, everything in this other world is reversed—including logic and mathematics.
Through the looking glass: Probability theory, frequentist statistics, and Bayesian statistics
261
This final chapter is kind of like stepping through a statistical looking glass. At the end of the day, statistics, mathematics, and logic are all instances of verbal behavior (Marr, 2015). Though still up for debate among philosophers, it’s unlikely that numbers, math, and logic have a physical reality beyond their arbitrary stimulus properties and socially determined function. And, because this is all verbal behavior constrained by rules and social convention, other systems of verbal behavior with different logical rules around math and numbers might be equally or more effective for describing the environment-behavior relations we are interested in and for predicting behavior. This chapter explores different ways of thinking about what it means when we aggregate data from many individual observations. Everything up to this point in the book has been from the lens of what is called frequentist statistics. But other approaches exist and, in this chapter, we step through one of those looking glasses in Bayesian statistics. As with everything else in this book, each approach has benefits and drawbacks and there are likely conditions under which each approach is most useful based on why you are collecting, aggregating, analyzing, interpreting, and communicating about data. Here, we simply want to describe the logic behind these approaches so that the curious reader has a better sense of what they’re in for when they choose to enter the lands of these statistical Jabberwockies. However, before we can get to alternative approaches, we first need to review probability theory and then the assumptions that frequentist statistics take related to interpreting what “probability” is shorthand for. We suspect the reason we chose the frequentist approach to use throughout the book will become clear as you read the chapter. But, if you’re still unsure or curious, find us at the next behavior analytic conference, buy us a beer, and we’ll get as deep into this as you’d like.
It’s assumptions, all the way down Throughout the book we have talked about the idea of science involving behaviors around models and model building. Rarely do we simply collect and aggregate data for the sake of collecting and aggregating data. Instead, we typically collect and aggregate data to do something with it. For example, to facilitate our evaluation of an intervention on behavior, we typically graph the response rate in each session (a statistical description) and clearly denote the order in which the sessions
262
Statistics for Applied Behavior Analysis Practitioners and Researchers
were conducted (i.e., a time series plot) and use visual cues to denote which sessions were baseline and which were an intervention (a conditional marker). Another way to frame the conversation is that data, by itself, doesn’t tell us anything. Imagine if you were shown the top panel in Fig. 11.1. With no x-axis, no y-axis, no condition labels, and nothing but the void of the universe surrounding it, it would be hard to argue that this random squiggle is useful in any practical way. But, as soon as we add a y-axis with ticks and a label (second panel), we start to get a sense of what the squiggle is meant to represent. Adding the x-axis with ticks and a label (third panel) gives us a temporal dimension. And, for most sciences anyway, adding this temporal dimension allows us to start talking about probable causes for changes in the location of squiggles on our graph. Adding the condition label (fourth panel) makes it even more clear what might be different across data and trends over time. But, to the groans of many, we’ll trot out the worn but still important truism that correlation does not equal causation. Just because two things change together in time does not mean they are causally related (see Vigen, 2015 for fun examples of this). The final pin that makes the entire graph interpretable and useful to the viewer is something that’s not visible at all: the assumptions you make when the visual stimuli that are tacted as “a plot depicting a reversal design” transverse through the air and dance across the back of your retina. For most behavior analysts, the assumptions that make the graph in Fig. 11.1 make sense are those resulting from an operant-respondent interpretation of behavior. Specifically, our conditioned assumption is that behavior change occurs through repeated exposure to contingencies over time. With that assumption, the potential causal relation between environment and behavior displayed in the figure becomes useful. Without that underlying assumption, the graph by itself would be hard to know what to do with. You had to learn what the graph “means” and how to practically behave following contact with that stimulus array.
Through the looking glass: Probability theory, frequentist statistics, and Bayesian statistics
263
Figure 11.1 Demonstration of the assumptions and context required to analyze and interpret data collected within a reversal design.
264
Statistics for Applied Behavior Analysis Practitioners and Researchers
Conditioned assumptions influence how we behave relative to statistics, too. And, if you don’t buy those assumptions, it becomes tricky to practically use the results you are looking at. The primary difference between each of the approaches described in the following sections is the assumptions they make about how to use logic and numbers to claim we “know” something based on the data we have collected. In doing so, we hope to highlight some of your own likely assumptions about what exactly your data tells you and, perhaps most importantly, that different behavior analysts may see the world differently than you. But, before we can get to a coherent overview of assumptions, we need to first review a fun set of verbal behavior referred to as probability theory. We promise it’s less scary than it sounds—plus, you’re likely already thinking this way but didn’t know people used special terms to talk about it.
Probability theory What is it? Formally, probability theory is an area within the broader field of mathematics aimed at precisely describing the relationships between events that do not hold perfectly (Rudas, 2010). For example, consider four simple schedules of reinforcement: fixed response (FR), fixed interval (FI), variable response (VR), and variable interval (VI). With an FR 1 schedule, every single time that the rat presses the lever, the pigeon pecks the key, or the client mands, a reinforcer is delivered (assuming the operant chamber didn’t malfunction and the therapist observed the mand). The relation between behavior and reinforcer delivery perfectly holds. However, for FR schedules greater than “1” and for FI, VR, and VI schedules, the relation between behavior and reinforcer delivery does not hold perfectly. If your grandma were to ask how long it will be before the poor rat in the cage gets more food for all the work that they’re doing, the most succinct and accurate answer would be, “It depends.” As you go about your daily practice, if you find yourself saying, “it depends” then you’re likely in the land of probability. Using probability theory to describe what we observe has many advantages. You likely are already familiar with the benefits of using numbers to describe probabilities by way of how we use numbers to describe schedules of reinforcement that are not continuous (i.e., are
Through the looking glass: Probability theory, frequentist statistics, and Bayesian statistics
265
not FR1). We may not know whether the very next response on VR 20 or VI 60-second schedules will contact reinforcement. But our standard numerical descriptions for schedules of reinforcement allow the reader to quickly know that reinforcement will follow—on average2—every 20th response (VR) or every response after—on average—every 60 seconds have passed (VI). This nomenclature for describing schedules of reinforcement is just one example of how behavior analysts already use probability theory to succinctly describe behavior-environment relations. Another common example of how behavior analysts use probability theory is when they conduct descriptive analyses to supplement their analysis of environmental variables related to disruptive behavior (Oliver et al., 2015)3. To illustrate, Antecedent-Behavior-Consequence (ABC) data collection is one descriptive method for structuring what people pay attention to and what they collect data on. Typically, this involves creating a data collection form where the data collector can take notes or record data on what happened immediately before a behavior (Antecedents), what exactly the behavior looked like (Behavior), and what happened after the behavior (Consequences). After collecting ABC data, behavior analysts aggregate their observations (statistics) to identify two important probabilities: First, how often behavior follows specific antecedents as they may not perfectly precede every instance of the behavior; and second, how often behavior leads to specific consequences as they may not perfectly follow every instance of the behavior. For example, maybe we find that 80% of the time the teacher presents work right before a student starts yelling and throwing chairs; and 70% of the time when the student yells and throws chairs they never end up doing that work. This is probability theory at work! You’re already doing it!
Why is this useful? Understanding that much of what behavior analysts already do involves probability theory is useful for several reasons. First, there are rules to how we can combine and calculate multiple probabilities. As a demonstration of why these rules matter, consider the basic rules of Or should we be saying, “on arithmetic mean” (Chapter 3)? Think this will catch on? Readers shouldn’t interpret our discussion as an endorsement of using only descriptive analyses to identify functional relations (see Hall, 2005; Lerman & Iwata, 1993; Pence et al., 2009; Thompson & Iwata, 2007). Rather, descriptive analyses seem most useful to inform and refine hypotheses to test during a functional analysis. 2 3
266
Statistics for Applied Behavior Analysis Practitioners and Researchers
addition. If you go to the grocery store, the sign says that an apple (Jason suggests Fuji apples) costs $0.50, and if you grab two apples, you expect to pay $1.00 to take those delicious apples out of the store. But what if the cashier inconsistently uses the rules of addition? What if they charge you $8.50 for two apples and the person behind you $0.10 for two apples because the rules for how to add two numbers together could change arbitrarily? That would make communication and collaboration between people quite tricky. The usefulness of following mathematical rules also holds for combining probabilities. Using the ABC data collection example above, what is the probability that yelling and throwing chairs follow work demands and that these behaviors lead to escape from that work? Without the rules of probability, it would be difficult to consistently answer the question: How often do work demands lead to yelling and throwing chairs which allows the student to escape work? The simplest rule to combine two probabilistic events is to multiply them together (i.e., calculate the joint probability). Using the ABC data collection example, there was an 80% chance that yelling and throwing chairs followed work demands (0.80), and a 70% chance that yelling and throwing chairs led to escape (0.70). Multiplying these together we get 0.80 3 0.70 5 0.56; a 56% chance that both events occur. Without these rules, it would be hard to answer consistently and precisely a question asked by a new staff member about the history of escape as a reinforcer when presented with work demands. A second reason it’s important to recognize that behavior analysts already use probability theory is because probability arguably underlies the very nature of our functional and contextual analytic approach. You may have noticed in the previous paragraph that the joint probability calculation missed a key component. We may not only be interested in the probability that a behavior followed an antecedent independent of the probability that the consequence followed the disruptive behavior. Restated using behavior analytic jargon, we often are interested in the three-term contingency: the probability that the antecedent-behaviorconsequence sequence occurred together. Probability theory gives us a very precise way to talk about these things using what’s called conditional probability. Conditional probabilities answer the question: What’s the probability that event B happened given that event A already happened? As
Through the looking glass: Probability theory, frequentist statistics, and Bayesian statistics
267
you can imagine, calculating probabilities based on probabilities is a bit more complex when those events depend on one another such as when we analyze operant and respondent behavior. Typically, the easiest way to do these calculations is to create what’s called a conditional probability table. We’ve done that for you in Table 11.1 using the ABC data collection example from above. The first thing you may recognize are those probabilities we described earlier (i.e., the top row shows the probability that work demands were followed by yelling and throwing chairs; the first column shows the probability that escape from work followed yelling and throwing chairs). You may also notice that focusing on just these two joint probabilities fails to tell the entire story as nowhere in either of those calculations are the two instances where we observed yelling and throwing chairs when some other antecedent occurred and the four instances where something other than escape followed these behaviors. Table 11.2 shows the equation we need to precisely calculate the probability that yelling and throwing chairs followed demands to engage in work and led to escape from those demands. Using behavior analytic jargon, Table 11.2 contains the equations to calculate the probability that yelling and throwing chairs are maintained by escape from demands. To do this we need a bit more information about the threeterm contingency we’re analyzing. Out of all the instances where yelling and throwing chairs occurred, we need to know the probability that those behaviors followed a demand which was followed by escape (Table 11.1; 12/20 5 0.60). We also need to know how often yelling and throwing chairs happened following demands (16/20 5 0.80). Once these two items are known, we can plug them into the conditional probability equation, and—Voila!—we can now state with precision there is a 75%
Table 11.1 Conditional probability table recording the frequency of antecedent and consequence events surrounding yelling and throwing chairs for hypothetical ABC data. Consequence Antecedent k
Escape from work
All other consequences
P(antecedent)
Demand to engage in work
12
4
(12 1 4)/20 5 0.80
All other antecedents
2
2
(2 1 2)/20 5 0.20
P(consequence)
(12 1 2)/20 5 0.70
(4 1 2)/20 5 0.30
1.00
Note: The denominator was calculated by adding all events to obtain a total number of events (12 1 4 1 2 1 2 5 20).
268
Statistics for Applied Behavior Analysis Practitioners and Researchers
Table 11.2 Demonstration of the different ways we can verbally describe operant and respondent analyses using the ABC data collection information from the text. Nonstatistical
Statistical
Mathematical
Using the
description
description
description
numbers from
Jargon
in-text example The student will yell and throw chairs most of the time they’re asked to do work. Sometimes that lets them get out of work. About half the time these both happen.
The student yells and throws chairs 80% of the time demands are presented. Escape follows this behavior 70% of the time. These both occur 56% of the time.
P(A, B) 5 P(A) 3 P(B) (Spoken: The probability that A and B both happen is equal to the probability of A multiplied by the probability of B).
0.56 5 (16/20) 3 (14/20) 0.56 5 0.8 3 0.7
Joint probability
The student will get out of work after they’ve been presented with demands more often than not. But, sometimes they yell and throw chairs when neither happens.
The student gets out of demands that have been presented to them 75% of the time when they yell and throw chairs.
P(A|B) 5 P(A - B)/P(B) (Spoken: The probability of A given that B has occurred is equal to the probability that A and B both occur divided by the probability that B occurs.)
0.75 5 (12/20)/ (16/20) 0.75 5 0.6/0.8
Conditional probability
chance that an escape contingency surrounds yelling and throwing chairs4. To round out this section, Table 11.2 also highlights the benefits we gain by using statistics, probability theory, and the relevant equations. If you were asked whether we have a sense of why the student yells and throws chairs, without statistics and probability theory, you would likely be stuck with the nonstatistical description shown in the far-left column. There isn’t anything inherently bad about using this way to describe what’s going on. In fact, when chatting with the student’s parents and other professionals during the student’s progress meeting, this may be the best way to communicate what’s going on. Nevertheless, the nonstatistical description lacks precision and would make it 4
Note that the probability of two independent events (joint probability) is different from the probability of two dependent events (conditional probability). Further, we calculated the conditional probability that escape followed our target behavior given that demands had been presented. We could also calculate the probability that demands were placed given that we observed escape: (12/60)/(14/20) 5 0.60/0.70 5 0.86 5 86%. This differs from the 75% conditional probability that escape would occur given that we observed demands and highlights the importance of precision and clarity in the focus of our analyses. It also highlights how different views of probability give different answers and suggests there might be conditions where one is more useful than the other.
Through the looking glass: Probability theory, frequentist statistics, and Bayesian statistics
269
difficult to talk precisely about how these contingencies (i.e., conditional probabilities) may change over time.
What exactly do these numbers mean? Probability theory and the related mathematical rules for combining statistics allow us to make statements such as, “There’s a 90% chance that Johnny gets attention when he hits people after they stop paying attention to him”; or “There’s a 68% chance that when Mohammed emits a high-pitched squeal he’ll engage in escape-maintained aggression.” But, if we take a step back, we can ask the question of what exactly we mean by a 68% chance or 0.68 probability that something will happen. Well, it turns out that there are at least two different ways that people often interpret what this means. And, going back to the importance of precision in scientific language, each interpretation leads to different ways of precisely defining what we’re talking about.
Frequentist approach The first common way to talk about probabilities in statistics is called the frequentist approach. This approach is quite straightforward and is likely intuitive to many behavior analysts. The frequentist approach simply assumes that all this talk of probabilities is just shorthand for how frequently we would expect something to happen if we had access to all relevant observations. We observe things happen, we collect data on when and where those things occur, and the relative frequencies that we observe different events occurring are our best guide to life. An example might help. Let’s say that 1000 people are asked to answer the question, “Who is your favorite baseball team?” Now, it’s an accepted fact that the Pirates are the best team in MLB and many agree that most people like to cheer for good teams. So, let’s say 800 people respond that the Pirates are their favorite team and 200 people respond that some other team is their favorite for some odd, unexplainable reason. If you were asked what percentage of all baseball fans prefer the Pirates, it would be hard to say much unless we assumed that the 1000 people we polled were representative of all baseball fans. Stated differently, the question becomes how well we can infer the frequency of all Pirates fans if we were to be able to ask all baseball fans.
270
Statistics for Applied Behavior Analysis Practitioners and Researchers
For those with an aversive and demeaning history with this brand of “inferential statistics,” and who may have begun to snicker, hold on a quick second. Consider many applied settings where behavior analysts provide services. In these settings, we conduct observations and collect data on a limited set of all environment-behavior relations for one specific behavior relative to all contingencies that behavior has ever contacted (i.e., all relevant observations). Let’s say, based on a limited sample of observations during a functional analysis (FA), we find the target behavior of interest might be maintained by escape (i.e., negative reinforcement). If you were asked, based on your FA data, what is the likely percentage that all previous emissions of the target behavior that have ever contacted escape? It would be hard to say much unless we were to assume that the small set of data we collected were a good enough representative of all the relevant functionally related behavior-environment relations for this target behavior, historically5. Said differently, we feel confident in our assumption that our small set of observations was a good enough representation of all relevant contingencies that have ever operated on that behavior to infer a probable function of the target behavior. So confident, in fact, that we develop a behavior intervention plan (BIP) based on the results of the FA and we bill tens of thousands of dollars to implement the BIP and train others to implement it. In both the abovementioned examples, we have a specific question we’re trying to ask wherein it’s impractical, if not impossible, to collect the data we would need to answer the question perfectly. As a result, we do our best with what data we can collect by assuming that if we get enough data and our method for targeting that collection is sound, then we can practically use the data to do something (e.g., create the right amount of merchandise for the MLB example; write a successful BIP for the FA example). Succinctly, the observed frequencies of events—if collected well—allow us to infer how the world works and act upon that information.
Interesting examples: What does this have to do with me? Taking a frequentist approach has implications for data handling in science and in applied behavior analysis. In science broadly, taking a frequentist approach would mean that your goal is to get as much data 5
See Chapter 7 if you forgot how to determine ways in which these claims might be made.
Through the looking glass: Probability theory, frequentist statistics, and Bayesian statistics
271
as you can on your phenomenon of interest. Why? Because it’s unlikely that any single observation or small set of observations will tell you the entire story about how the world works. Nature is wonderfully complex and few events seem to occur without fail. It’s because of this probabilistic nature of the universe that replication has such high esteem within scientific research. Weird stuff happens and random things occur. If you can consistently replicate a finding, then you’re likely onto something. From the frequentist perspective, the idea is to get as many observations as possible so we can make more accurate and specific claims about the frequencies with which different events are likely to occur. The same implication for data handling holds for behavior analysts. If you take a frequentist approach, then increasing the number of observations is the name of the game. Few behavior analysts would likely argue that they can see a behavior occur in context a single time and confidently identify the function of the behavior. But, when five out of five observations suggest the same function, we likely feel a bit more certain about the function and the likely success of a related BIP. And, once 95 out of 100 observations suggest the same function, then we’re likely feeling really certain that a BIP designed around that function will be successful. Taking a frequentist approach means your uncertainty about behavior-environment relations will decrease by collecting more data so you can more accurately define those frequencies. And, with enough data, that uncertainty approaches zero.
Bayesian approach The second common way to talk about probabilities in statistics is called the Bayesian approach. This approach might be less intuitive to behavior analysts, but the logic is rather straightforward. The Bayesian approach assumes that all this talk of probabilities is just shorthand for how well a new observation aligns with our knowledge of the world (aka learning history). Each new observation of an environmentbehavior relation provides new information that either deviates or aligns with our total set of past experiences. In more technical terms, discussions of probability are simply restatements of what we currently know about an addressable question, and these statements can be updated with each new experience. That is, we observe things happen, we collect data on when and where those many things occur, and we
272
Statistics for Applied Behavior Analysis Practitioners and Researchers
can aggregate that information to make predictions about what happens in life based on our learned history. An example might help us differentiate the frequentist approach from the Bayesian approach (Chechile, 2020). Consider an unbiased coin6 that we can flip into the air. Before flipping the coin, if you were asked the probability that the coin will land heads (and not tails), we suspect you’ll likely guess 50%. Why? Well, there are two sides to the coin, one of those two sides is heads, so P(heads) 5 1/2 5 50%. Now, consider a situation where right before the coin was flipped, a friend told you that they already flipped the coin 10,000 times and 4876 of those flips landed heads and the other 5124 landed tails. The frequentist approach starts to lead us to some interesting contradictions. Based on the observed frequencies, a reasonable probability that the coin will land heads would now be 48.76%. But, was our sample size too small? If it’s an unbiased coin, then on average, and in the long run, we should still guess that the probability the coin will land heads is 50% based on this knowledge. Let’s also add one final twist! Suppose that within 200 ms after the coin has been flipped, high-speed cameras feed data into a computer programmed to analyze the observed data relative to the laws of physics. This program is correct 99% of the time and predicts this flip of the coin will land heads. While the coin is flipping in the air, you’re asked again with what probability you predict the coin will land heads. A reasonable response would be 99%. If you have no qualms with the changing probabilities in the previous paragraph, congratulations—the Bayesian approach is intuitive for you. Why is all the above problematic for the frequentist approach? Because the observed frequencies didn’t change in any of those scenarios. If you take a frequentist approach, then you should guess the same probability that the coin lands heads in all the scenarios above, regardless of your updating knowledge about the situation. In contrast to observed frequencies, the Bayesian approach assumes that statements of probabilities are just shorthand for the aggregation of information we have about a situation before an event occurs. Some of that information might come from observed frequencies. But some of that information may come by way of generalizing information relevant to the 6
By this we mean a coin that is evenly weighted such that there is no reason to suspect it will come up heads more often than tails.
Through the looking glass: Probability theory, frequentist statistics, and Bayesian statistics
273
current context (e.g., laws of physics when flipping coins, behavioral principles and processes when predicting future behavior). An example from behavior analysis might be useful. Baum and Davison conducted a series of experiments wherein they sought to identify how reinforcers control behavior7. The gist of the logic framing these studies was that time can’t flow backward and, as such, reinforcers can’t reach back in time to strengthen a response that occurred immediately before it. So, the question then becomes how does the addition or removal of a stimulus following behavior increase responding in the future? Through a series of very neat experiments, Baum and Davison found that reinforcers seem to be acting as discriminative stimuli that guide responding in future similar situations. Nothing problematic here from a frequentist perspective. The twist comes when contingencies are arranged such that when one response contacts reinforcement, the next reinforcer is set up to follow a different response. Frequentist approaches built on only historical counts might lead us to incorrect predictions until the counts with the new contingency arrangements “correct” themselves8. However, a Bayesian approach that incorporates this additional knowledge likely could begin making accurate predictions almost immediately. In both the “coin flip” and “reinforcers as signposts” examples, we’re trying to identify the conditions under which some environmentbehavior relation does and does not hold. Stated from the Bayesian perspective, given our prior knowledge of environment-behavior relations, we want to identify how future observations might change, alter, or provide further support for that knowledge we bring into the situation. If you find yourself getting excited about this topic, then the next step in comparing assumptions across these approaches would be to wade into the dark (but warm) waters of epistemology, what it means “to know something,” and how we aggregate and explain all this verbally. We’ll leave that conversation for another book. For now, let’s pivot to more practical matters.
7
See Davison (2015) for a video lecture summarizing this work as well as how Cowie has more recently extended this fascinating area of research (Cowie et al., 2016, 2017). 8 There are a few logical backdoors out of the frequentist perspective failing to make accurate predictions until we obtain enough observations. Send us an email with your thoughts and we’ll let you know if you caught some of the ones we were thinking of!
274
Statistics for Applied Behavior Analysis Practitioners and Researchers
Interesting examples: What does this have to do with me? Taking a Bayesian approach has implications for data handling in science and in behavior analysis. In science broadly, taking a Bayesian approach would mean that your goal is to derive conditions and collect data to provide you with the most relevant new information. That is, it’s not necessarily about the volume of data collected (though more always helps), but rather how each new observation provides you with new information about the conditions under which something holds, does not hold, or might be changing. When establishing data collection procedures, the focus might be on identifying and testing boundary conditions to better understand how data from many different sources and situations combine into a complete picture that allows for description and prediction. That is, you want to identify ways to test your total knowledge of a system. The Bayesian approach also takes a bit of a different approach to analyzing data. Bayesian statistics rely on Bayes theorem which is a mathematical formula that tells you how much your beliefs about event probabilities should change with each piece of new information you contact (for a deep dive, see Joyce, 2021). As a formula, Bayes theorem is: PðAjBÞ 5
PðAÞ 3 PðBjAÞ PðBÞ
(11.1)
In words, this is spoken as, “The probability that A happens given B happened is equal to the probability that A happened times the probability that B happens given A has happened divided by the probability that B happens.” Logically, this equation involves a few bits of information that highlight how it deviates from the frequentist approach to significance testing. The idea is that we enter any experiment with some kind of belief about what the parameters (e.g., mean, median, standard deviation) should be based on our knowledge of the world. This is referred to as “the prior,” P(A), because it is the information we had prior to the experiment starting. We then collect some data from our experiment and calculate P(B|A) which is “the likelihood” that we observed the data we got assuming the priors are true. The last bit is calculating the “probability of the data” P(B) which is determined by the evidence we collected in the experiment. From there, you can plug and chug
Through the looking glass: Probability theory, frequentist statistics, and Bayesian statistics
275
and calculate “the posterior” which is the probability of the parameters given your data and which informs you how your “priors” should change based on the data you collected. If this sounds like a lot to track and calculate it’s because it is. Technically, Bayesian statistics have been around longer than frequentist tests of statistical significance but were not widely adopted for two likely reasons. First, the statistical calculations and tests rely on you specifying the priors of your parameters before an experiment which can be very difficult without access to previous datasets (thank you, Internet!)9. Second, the calculations and statistical knowledge needed to pull them off is a bit more complex than frequentist approaches which could be very difficult to do using paper and pen10. However, increased access to datasets from past research as well as improved computational technologies now allow for both challenges to be overcome quite readily. The result has been a steady increase in the use of Bayesian statistics by researchers in fields such as psychology (e.g., van de Schoot et al., 2017), education (e.g., König & van de Schoot, 2018), and medicine (e.g., Hackenberger, 2019). So what does this mean for you? Well, it means absolutely nothing if you don’t like the Bayesian approach or it sounds too complicated. Frequentist approaches still seem to be the dominant statistical tests in the empirical literature. You’ll be just fine in life knowing that Bayesian approaches exist and why someone might use them. And, both approaches can be logically justified. However, if you want to dive more into this realm of quantitatively specifying and updating your beliefs about the world, then your yellow brick road leads through Google searches around “Bayesian alternatives to [insert statistical test of interest here].”
Chapter summary Statistics are just verbal behavior. As a result, there are many different ways one might go about collecting, analyzing, interpreting, and presenting quantitative data. At the end of the day, all this talk is really 9 Though it is not really possible to have no prior information, researchers have been experimenting with methods to create “ignorant” or “uninformed” priors to circumvent instances where the researcher cannot (or does not want to) make strong assumptions about their priors (e.g., Strachan & van Dijk, 2008). However, caution should be used when taking these approaches as the results can be biased and impact the statistical results (e.g., Van Dongen, 2006). 10 See van Doorn et al. (2020) for guidelines for conducting and reporting Bayesian analyses.
276
Statistics for Applied Behavior Analysis Practitioners and Researchers
about probabilities. What is the most probable amount, duration, intensity, or rate of responding (central tendency)? What is the most probable amount of variability we will see in any one session (variation)? What is the most probable actual difference in behavior between baseline and my intervention (significance and effect sizes)? What is the probable relevance of other IVs on behavior or the probable relation between levels of IVs and behavior (quantitative modeling)? And, how do these probabilities change when I control for time? In sum, statistics for behavior analysts (and all sciences, really) is based on probability theory. Throughout the book, we have relied primarily on the frequentist approach to turn our learning history into verbal behavior about probabilities. That is, we start by counting the frequency that different antecedents, behaviors, and consequences occur. And, on average and in the long run, those frequencies will create a distribution based on their data type that we can use to describe and make predictions about behavior-environment relations. But there are other ways to use verbal behavior based on probabilities to describe and make predictions about behavior-environment relations. One increasingly popular method is via Bayesian statistics which allows us to include data from past observations as well as other knowledge we have that might impact our predictions prior to an experiment. Though a bit more sophisticated and complex statistically, software programs and open datasets make Bayesian analyses increasingly tractable for those willing to jump in and think about the world through the Bayesian looking glass. As you stare into the abyss through this looking glass (and it stares back at you), just be cautious you don’t turn into a Mad Hatter.
Closing thoughts It’s time to wind down this long, strange trip that we’ve been traveling together. We want to reiterate our sincere thanks to you for picking this book up and we hope that it was written in a way that made statistics enjoyable (or at least tolerable). We also hope it is now glaringly obvious that you currently use statistics in your everyday work as a behavior analyst. We further hope that some of the material in the book has prompted you to think of additional and different ways that you might use statistics to improve the precision with which you describe, predict, and improve the lives of the people with whom you work.
Through the looking glass: Probability theory, frequentist statistics, and Bayesian statistics
277
Mathematics and statistics have proven to be remarkably and practically useful in many professional and scientific disciplines. Just like Spanish, Yoruba, or Klingon, statistics and mathematics are simply systems of verbal behavior. And, by itself, verbal behavior is not scary or dangerous. With this introduction to statistics out of the way, you should be well prepared for whatever statistical verbal challenge you choose to take on next. And we hope the next time you hear the word “statistics” that your pupils dilate and your pulse quickens from pure delight at the wonderful descriptions of our beautiful universe that you are about to encounter.
References Carroll, L. (1893). Alice's adventures in wonderland. T. Y. Crowell & Co. Carroll, L. (1909). Through the looking-glass and what Alice found there. Henry Altemus Company. Chechile, R. A. (2020). Bayesian statistics for experimental scientists: A general introduction using distribution-free methods. MIT Press. Cowie, S., Davison, M., & Elliffe, D. (2016). A model for discriminating reinforcers in time and space. Behavioural Processes, 127, 6273. Available from https://doi.org/10.1016/j.beproc.2016.03.010. Cowie, S., Davison, M., & Elliffe, D. (2017). Control by past and present stimuli depends on the discriminated reinforcer differential. Journal of the Experimental Analysis of Behavior, 108(2), 184203. Available from https://doi.org/10.1002/jeab.268. Davison, M. (2015). What reinforcers do to behavior. Society for the Quantitative Analysis of Behavior, Retrieved from. Available from https://youtu.be/qYFKWTAJjEo. Hackenberger, B. K. (2019). Bayes or not Bayes, is this the question? Croatian Medical Journal, 60(1), 5052. Available from https://doi.org/10.3325/cmj.2019.60.50. Hall, S. S. (2005). Comparing descriptive, experimental, and informant-based assessment of problem behaviors. Research in Developmental Disabilities, 26(6), 514526. Available from https://doi. org/10.1016/j.ridd.2004.11.004. Joyce, J. (2021). “Bayes’ theorem”. In E.N. Zalta (Ed.). The Stanford Encyclopedia of Philosophy. Retrieved from: https://plato.stanford.edu/archives/fall2021/entries/bayes-theorem/ König, C., & van de Schoot, R. (2018). Bayesian statistics in educational research: A look at the current state of affairs. Educational Review, 70(4), 486509. Available from https://doi.org/ 10.1080/00131911.2017.1350636. Lerman, D. C., & Iwata, B. A. (1993). Descriptive and experimental analyses of variables maintaining self-injurious behavior. Journal of Applied Behavior Analysis, 26(3), 293319. Available from https://doi.org/10.1901/jaba.1993.26-293. Marr, M. J. (2015). Reprint of “Mathematics as verbal behavior.”. Behavioural Processes, 114, 3440. Available from https://doi.org/10.1016/j.beproc.2015.03.008. Merriam-Webster (2021). Statistics. Retrieved from the website: https://www.merriam-webster. com/dictionary/statistics. Oliver, A. C., Pratt, L. A., & Normand, M. P. (2015). A survey of functional behavior assessment methods used by behavior analysts in practice. Journal of Applied Behavior Analysis, 48(4), 817829. Available from https://doi.org/10.1002/jaba.256.
278
Statistics for Applied Behavior Analysis Practitioners and Researchers
Pence, S. R., Roscoe, E. M., Bourret, J. C., & Ahearn, W. H. (2009). Relative contributions of three descriptive methods: Implications for behavioral assessment. Journal of Applied Behavior Analysis, 42(2), 425446. Available from https://doi.org/10.1901/jaba.2009.42-425. Rudas, T. (2010). Probability theory. In P. Peterson, E. Baker, & B. McGaw (Eds.), International Encyclopedia of Education (pp. 3336). Elsevier, ISBN: 978-0-08-044894-7. Strachan, R. W., & van Dijk, H. K. (2008). Bayesian model selection with an uninformative prior. Oxford Bulletin of Economics and Statistics, 65(s1), 863876. Available from https://doi. org/10.1046/j.0305-9049.2003.00095.x. Thompson, R. H., & Iwata, B. A. (2007). A comparison of outcomes from descriptive and functional analyses of problem behavior. Journal of Applied Behavior Analysis, 40(2), 333338. Available from https://doi.org/10.1901/jaba.2007.56-06. van de Schoot, R., Winter, S. D., Ryan, O., Zondervan-Zwijnenburg, M., & Depaoli, S. (2017). A systematic review of Bayesian articles in psychology: The last 25 years. Psychological Methods, 22(2), 217239. Available from https://doi.org/10.1037/met0000100. Van Dongen, S. (2006). Prior specification in Bayesian statistics: Three cautionary tales. Journal of Theoretical Biology, 242(1), 90100. Available from https://doi.org/10.1016/j.jtbi.2006.02.002. van Doorn, J., van den Bergh, D., Böhm, U., Dablander, F., Derks, K., Draws, T., . . . Wagenmakers, E. J. (2021). The JASP guidelines for conducting and reporting a Bayesian analysis. Psychonomic Bulletin & Review, 28, 813826. Available from https://doi.org/10.3758/s13423020-01798-5. Vigen, T. (2015). Spurious correlations. Hachette Books, ISBN-13: 978-0316339438.
INDEX
Note: Page numbers followed by “f” and “t” refer to figures and tables, respectively.
A Abductive reasoning, 11 12 Accuracy, 67 68, 162 163 Aggregate percentage difference from mode (APDM), 90, 91t, 92 93 Aggregation data without, 54f right level of, 55 Agnosticism, 101 Alternative hypothesis, 103, 106 Alternative means, 58 59 Analytic interests, 113 Analytic strategy, 104 ANOVA variants, 110 111 Applied behavior analysis (ABA), 1, 12 17, 31, 62 66, 86, 128 Applied quantitative analysis of behavior, 16 17 ARIMA. See Autoregressive Integrated Moving Average (ARIMA) models Arithmetic mean, 56 58, 61, 65, 79 80 calculating, 56 Association indices, strength of, 121 125 Assumptions conditioned, 262, 264 demonstration, 263f Audience control, 94 Autism spectrum disorder (ASD), 103 Autoregressive Integrated Moving Average (ARIMA) models autoregressive model, 244 245 differencing, 242 243 moving average, 245 247 SARIMA, 246 247, 246f
B Balanced dataset, 153 Bayesian statistics, 271 276 Bayes theorem, 274 “coin flip” example, 272 273 data collection, 274 and frequentist approach, 272 273 priors, 274 275
in psychology field, 275 “reinforcers as signposts” example, 273 Bayes theorem, 274 Behavior, 135 136 average future patterns of, 88 counts of, 64 descriptions, predictions, and control of, 8 natural variation in, 83 occurrence and degree of, 5 Behavioral perspectives, 143 144 Behavioral processes, 2 Behavior analysis, 10, 27, 94, 118 119 nonlinear statistical models in, 145 146 statistical models in, 146f Behavior analysts, 2 3, 6 7, 9 10, 12 14, 16, 22 24, 25t, 26, 28 29, 31 32, 35, 46, 51 52, 55, 57, 59, 66, 68 69, 76, 102, 112, 114, 139 140, 144 145 benefits, 5 example of, 69 functioning, 86 nominal data, 27t relevant for, 125 126 statistics for, 99 Behavior analytics, 6 practitioners, 1 standards, 126 127 writings, 127 Behavior-environment relations, 3, 46, 88, 140 popular statistical model of, 140 142 Behavior intervention plan (BIP), 270 271 Behavior, models of, 139 154 classification models, 148 154 regression models, 140 148 beginning with familiar, 140 144 generalizing to bigger picture, 144 148 Behavior of interest, 16 Behavior science, models in, 7 8, 7f Bell curve, 86 87 Big Data, 9 10 Binary classification, 148 149 Binomial data, different sets of, 37 38 Binomial distribution, 36 38 Biological organism, 35 36
280
Index
BIP. See Behavior intervention plan (BIP) Broader analytic landscape, 136 139 mean to control, 136 138 nonempirically controlled variables, 138 summarizing our situation, 139
C CDC method. See Conservative dual criteria (CDC) method Central limit theorem, 42, 86 87 Central tendency, 52 53, 55 agenda, 52 53 in applied behavior analysis, 62 66 arithmetic mean, 56 58 common descriptions of, 55 62 correct measure of, 140 142 descriptions of, 51 52, 85 high-level overview, 53 55 landscape, 58 measures of, 56, 58f, 61 62, 78, 82, 84 85, 87 88, 118 median, 59 60 mode, 60 61 point estimates, 75 types of means, 58 59 Chi-squared tests, 111 Classic time series, 31 32, 32f Classification model fit, 161 166 accuracy, 162 163 F1, 164 false rates, 164 165 loss metric summary, 165 166 Matthew’s correlation coefficient (MCC), 165 predictive value, 163 true rates, 163 164 Classification models benefits, 214 216 IVs, 213 limitations, 216 parameters, 169 170 Clients’ caregivers, 83 Clinical/educational context, 78 CLs. See Criterion lines (CLs) Cohen’s d, 118 119 Compromise power analysis, 193 194 Conditional probability, 266 267, 267t Conditioned assumptions, 262, 264 Confidence intervals, 86 89 calculation, 88 for medians, 88, 89t Confounding variables, 138 Consensus, 93 94
Conservative dual criteria (CDC) method, 206 208 Continuous data, 32 33, 52 use of, 33 Continuous data distributions, 39 44 normal distribution, 39 42 exponential distribution, 43 44 lognormal, 43 Poisson distribution, 43 quick recap and resituating ourselves, 44 47 Continuous variables, 118, 121 122, 125 Control in behavior science, 136 137, 136t condition, 137 groups, 119 “Correct” category label, 151 153 Correlations coefficient, 122 123, 123f features of, 125 metrics and data types, 124t type of, 122 Criterion lines (CLs), 204 205 Criterion power analysis, 193 194 Cumulative frequency graph, 13 14, 13f Cumulative probability, 39 Cyclicity, 226, 229 230, 239
D Data aggregation, information through, 53 analyzing, 24 collecting and aggregating, 23 measurement, 27 28 trends, 81 Data collection, 68, 83, 274 Data collection, behavior analytic research intervention effect, 189 192 multiple predictor variables, 192 observations participants, 186 189 responses, 178 183 sessions, 183 186 power analysis, 189, 193 194 variations, 193 194 Data distributions, 34 45, 40t continuous, 39 44 normal distribution, 39 42 discrete, 36 39 binomial distribution, 36 38 geometric distribution, 39 negative binomial distribution, 39 probability distributions, 34 36
Index Data provenance, 253 backups, data, 253 chain of behaviors, 253 documenting, 253 information, 253 missing data, 253 modeling, 253 Datasets average value of, 79 comparing, 108 111 comparison, 175 quantitative modeling, 176 types of, 2 Data types, 47, 95 96 categories of, 34 continuous data, 32 33 definition of, 24 34 description of, 21 24 discrete data, 24 25 and distributions, 47 learning to identify, 34 nominal data, 25 27 ordinal data, 27 30 quantitative discrete data, 30 32 summary, 34 Data variability, 94 descriptions of, 76, 95 96 Datum, aggregation of, 52 Decision-making models, 12 13 Decision tree, 45 46, 45f Deductive reasoning, 10 12 Dependent variable (DV), 4 5, 22 23, 109 110 Descriptive models, 7 8 Descriptive statistics, 9 10, 99 100, 253 254 central tendency, 254 data spread, 254 data, type of, 254 distribution, data, 254 level, data aggregation, 254 Dichotomous variables, 121 122 Differencing, 242 243 description, 243 first-order, 243, 244f Discrete buckets, 31, 67 68 Discrete data, 24 25, 52 distributions, 36 39 binomial distribution, 36 38 geometric distribution, 39 negative binomial distribution, 39 Discrete quantitative data, 64 Distributions, 34 35 characteristics about, 38
281
Dual criteria (DC), 206 208 Dummy variables, 26 27 Dynamics behavioral, 183 184 response, 182
E Educational history, 68 Effect size, 100, 255 benefit of, 114 115 calculation and use of, 115 116 definition of, 112 126 estimates, 113 group difference indices, 118 121 measures benefits, 212 function, 211 limitations, 212 213 Tau-U, 211 212 risk estimates, 115 118 single-case experimental designs (SCED), 125 126 strength of association indices, 121 125 Efficiency, and utility, 6 7 Emergency room (ER), 30 Environment and behavior, 8 and behavior relations, 84 86, 138 variables, 16 Environmental stimulus, 5, 43 44 Epistemology, 273 Exotic data, 52 Experimental control, 137 Experimental groups, 119 Exponential distribution, 43 44 Exponential smoothing, 238 forecasting with trends, 241 242 simple, 240 242, 241f variations of, 238 239
F F1, 164 FA. See Functional analysis (FA) False dichotomy, 12 False negative rate, 164 165 False positive rate, 164 165 False rates, 164 165 Feature scaling, 170 171 Forecasting behavior ARIMA models, 242 247 definition, 230 exponential smoothing, 240 242 future behavior prediction, 239 240
282
Index
L
Frequentist statistics BIP, 270 271 data handling, 270 271 description, 269 functional analysis, 270 inferential statistics, 270 target behavior, interest, 270 Functional analysis (FA), 14, 270 Functional communication response (FCR), 65
Latency, 33 Learning history, 138 Level, 226, 228 229, 236, 238 239, 242 Linear regression parameter, 168 Line graph, 14, 14f, 15f Logistic regression models, 149, 151 Logistic regression output, 150f Lognormal distribution, 43 Loss metric summary, 165 166
G
M
Generalization, behavioral processes of, 11 12 Geometric distribution, 39 Geometric mean, 58 59 G Power, 189, 190f Group design research, 9 10 Group difference indices, 120 121
H Harmonic mean, 58 59 Hedge’s g, 118 119 Holt-Winter’s method, 238 Hypothetical data for risk estimate calculations, 115 116, 116t Hypothetical number of individuals, 117
I Imbalanced dataset, 153 Independent correct responses, 68 69 Independent variable (IV), 4 5, 22 23, 110, 166 171 Inductive reasoning, 10 11 Inductive vs. deductive reasoning, 11 Inferential statistical tests, 108 109 Inferential statistics effect size, 255 social significance, 255 statistical significance, 254 255 Interobserver agreement (IOA), 16, 22, 67 68 calculating percentage using IOA, 67t Interpreting models, 154 171 Interquartile range (IQR), 78, 81 Intervention design, 34 Intervention effectiveness, 126 127
J Joint probability, 266
K Kendall’s Tau, 125
Mann-Whitney U test, 109 110 Math, and statistics, 1 Matthew’s correlation coefficient (MCC), 165 Maximal variability, 94 Mean absolute error (MAE), 159 Mean difference, 120 indices, 120 121 Mean response rate, 88 Mean squared error (MSE), 159 Means, types of, 58 59, 61f Mean to control, 136 138 Median, 59 60 confidence interval for, 89t Medical center, 100 Metrics that penalize model complexity, 160 161 Min and max values, 76 Minimum and maximum values of data, 80 81 Mixed effects models benefit, 217 computationally demanding, 218 219 fixed effects, 218 limitation, 217 219 negative intraclass correlations, 218 observations, cluster, 218 random effects, 218 Mode, 60 61 Model complexity, 160 161 Modeler’s behavior, 7 8 Modeling approaches benefits, 214 216 limitations, 216 multilevel, 217 nested, 216 219 Models, types of, 144t Moving average models, 245 247 Multinomial classification, 148 149, 152f Multiple observations, numerical representation of, 53 Multivariable linear regression model, 146
Index
283
Multivariable models, 146 relative influence in, 170 171 Multivariable nonlinear regression models, 147 Myths and misconceptions about statistics, 9 12
One-hot encoding, 26 27 Ordinal data, 27 30 Ordinal discrete data, 44 Outliers, 80 81 Overlapping statistics, 208 209
N
P
NAP. See Nonoverlap of all pairs (NAP) Negative binomial distribution, 39 Negative predictive value, 163 Nested modeling approach benefits, 217 description, 216 217 limitations, 217 219 mixed effects modeling, 216 219 NHST. See Null hypothesis significance testing (NHST) Nominal data, 25 27 Nominal discrete data, 44, 52 Nondominant categories, 91 92 Nonempirically controlled variables, 138 Nonempirically controlled (NEC) variables, 138 Nonlinear regression parameter, 168 169 Nonoverlap of all pairs (NAP), 208 209 (Non)overlap statistics, 208 benefits, 209 ease of calculation, 209 limitations, 209 211, 210f Nonparametric tests of significance, 110 Normal distribution, 39 42, 87 definition of, 42 exponential distribution, 43 44 lognormal, 43 negatively skewed, 43 Poisson distribution, 43 positively skewed, 43 quick recap and resituating ourselves, 44 47 Null hypothesis, 103, 105 106 Null hypothesis significance testing (NHST), 3, 9, 22 23, 102, 111 112, 211 212 common criticisms of, 106 last critique of, 107 Numbers, 21 benefits of, 4 6 measurements, 30 and statistics, 4 5
PAND. See Percentage of all nonoverlapping data (PAND) Parametric tests of significance, 108 110 Percentage, 66 70 of intervals, 69 70 Percentage of all nonoverlapping data (PAND), 208 209 Percentage of nonoverlapping data (PND), 208 209 Percentage of trials with correct responses, 51 52, 67 68, 71 PND. See Percentage of nonoverlapping data (PND) Point estimates of central tendency, 55 definition of, 52 deriving, 70 reviewing, 52 53 statistical, 70 Points exceeding the median (PEM), 208 209 Poisson distributions, 43, 58 59 Positive predictive value, 163 Post-hoc power analysis, 193 194 Power analysis, 177, 188 189 compromise, 193 194 criterion/post-hoc, 193 194 description, 189 precision, 191 sensitivity, 193 194 software packages, 191 192 Practical significance, 114 Predictive value, 163 Preference assessments, 69 Preparation program, 115 116 Probabilities, statistics Bayesian approach, 271 275 frequentist approach, 269 271 Probability, 34 35 distribution, 36, 38 mass function, 39 Probability theory, 261 ABC data collection, 265, 268t calculation, 267 268 conditional, 266 267, 267t description, 264 265 descriptive analyses, 265
O Observations, data collection participants, 186 189 responses, 178 183 sessions, 183 186 Odds ratio (OR), 115 117
284
Index
Probability theory (Continued) joint probability, 266 mathematical rules, 269 usefulness, 265 269 Push-up data, 100
Q Quantitative data, 21 Quantitative discrete data, 30 32, 52 Quantitative theory, 21 Quartiles, 78
R Range, 76 78 Rate of responding, 63 66 Reasoning, deductive, 10 12 Refresher on free parameters, 167 Regression analyses, metrics for, 155f Regression models, 140 148 beginning with familiar, 140 144 behavior analysts, 147 148 benefits, 214 216 generalizing to bigger picture, 144 148 IVs, 213 limitations, 216 linear and nonlinear, 213 214 Regression tasks, loss metrics for, 155 158, 156t Reinforcement rate of, 147 ratio of, 142 143 Relative response rates, 141f Research methodologies, 1 2 Research participants, 26 Residuals, 155 plots, 155 Response-by-response analysis, 13 14, 52 53 Response probability, 35 Response rate, 64 65 Responses per minute, 51 52, 62, 64 65 arithmetic mean of, 63 64 median, 71 Response variability, 178 183 Risk difference (RD), 115 116 Risk estimates, 115 118 interpreting, 117 118 Risk ratio (RR), 115 117 Root mean squared error (RMSE), 160
S Sample size, 186, 256 multiple predictor variables, 192 SCED research, 192
Sampling variability, 100 SARIMA. See Seasonal ARIMA (SARIMA) models SCEDs. See Single-case experimental designs (SCEDs) Seasonal ARIMA (SARIMA) models, 246 247, 246f Seasonality decompose, 233, 234t definition, 229 sinusoidal pattern indicative, 233, 234f Sensitivity power analyses, 193 194 Significance level, 103 Significance testing, 107 108 Single-case designs statistical analysis, time series data, 203 Single-case experimental designs (SCEDs), 125 126 external validity, 187 188 and group designs, 187 188 internal validity, 187 Single numerical representation, 75 Skill acquisition, 68 69 Social initiations, 57t, 60, 60t identical, 60 Social significance, 100, 126 129 of change, 99 100 Social validity, 65 66, 128 centrality of, 128 129 construct of, 128 129 measures, 128 129 Space, issue deserving of, 117 118 Spanning time zones, 4 Stable responding, 183 184, 193 Staff training, 69 Standard deviation, 42, 78 80, 82, 119 120 calculating, 79 primary drawback to using, 82 visualization for, 87f Standard error, 83 86, 84f equation, 86 87 Standard error of the mean (SEM), 84 description of, 85 Standardized mean difference, 118 Stationarity, 226, 229 230, 236 Statistical analysis skepticism of, 3 time series data effect size, 211 213 “nested” modeling approach, 216 219 overlapping statistics, 208 211 regression modeling, 213 216 structured criteria, 203 208
Index Statistical decision-making. See also Statistics checklist modeling, 255 256 Statistical literacy, 99 Statistical models, 140 Statistical presentations data provenance, 253 efficiency, 257 inferential statistics, 254 255 Statistical significance, 100, 102 categorical dependent variables and independent variables, 111 comparing datasets, 108 111 criticisms of null hypothesis, 106 108 definition of, 102 112 description of, 99 101 summary of, 111 112 tests of, 112 understanding of, 102 Statistical tests, 109f Statistical writing data provenance, 253 descriptive statistics, 253 254 efficiency, 257 inferential statistics, 254 255 Statistics, 3, 16 17, 21, 135 136 in applied behavior analysis, 12 17 in behavior analysis, 22 by behavior analysts, 52 benefits of, 5 numbers, 4 6 central tendency, 175, 177 deductive reasoning, 10 12 definition of, 3 4, 51 52, 175, 203 description of, 1 2 descriptive, 175, 177 false dichotomy, 12 group design research, 9 10 inferential, 177 models and model building, 6 8 myths and misconceptions about, 9 12 role of, 8 use of, 6 variation, 175, 177 Statistics checklist data provenance, 253 descriptive statistics, 253 254 efficiency, 257 inferential statistics, 254 255 modeling, 255 256 sample size, 256 time series considerations, 256 257 Steady-state responding, 183 184 Strategic planning dynamites, 2
Structured criteria approach benefits, 206 CLs, 204 206 limitations, 207 208 visual analysis, 203 204, 205f, 206 Subjective evaluation, 127 128 Symbols, use of, 79
T Target behavior, 39 total instances of, 39 Target response, 37 38 Task presentation, 15 Tau-U, 211 212 Theoretical descriptions, 136 137 Theoretical extensions, 137 Three-term contingency, 6 Time behavior analysis, role in, 199 200 distribution, analyses, 200 and effect sizes, 201 series filter, 238 239 social significance, 201 statistical significance, 201 variability, 200 201 Time series analysis nonstationary, 242 243 stationary, 242 243 statistical, 228 considerations, 256 257 data, 81 82 central tendency, 200 201 effect sizes, 201, 211 213 graph characteristics, 200 nested/hierarchical data, 216 219 overlapping statistics, 208 211 regression modeling, 213 216 statistical analysis, 203 structured criteria, 203 208 visual analysis, 200 decomposition cyclicity, 226, 229 230 data characteristics, 229 230 forecasting, 230, 239 247 intraocular assault test, 233 235 moving average, 232 233 purpose of, 230, 232 seasonality, 229, 233 235 stationarity, 229 time series plots, 231f variability, 235 236 variations, 236 239 Total observations, 38
285
286
Index
Transformational reasoning, 11 12 Treatment fidelity, 69 Trend, 226, 229 decomposition model, 230 232 exponential forecasting, 241 242 isolating, 232 233, 236 multiplicative decomposition model, 236 238 and seasonality, 236, 238 Trimmed means, 59 True negative rate, 163 True positive rate, 163 True rates, 163 164
U Uncertainty, 63 Uncontrolled variables brief primer on interpreting models, 154 171 regression model, 155 161 broader analytic landscape, 136 139 mean to control, 136 138 nonempirically controlled variables, 138 summarizing our situation, 139 classification model fit, 161 166 accuracy, 162 163 F1, 164 false rates, 164 165 loss metric summary, 165 166 Matthew’s correlation coefficient (MCC), 165 predictive value, 163 true rates, 163 164 description of, 135 136 independent variable, 166 171 interpreting classification model parameters, 169 170 linear regression parameter, 168 nonlinear regression parameter, 168 169 relative influence in multivariable models, 170 171 mean absolute error (MAE), 159 mean squared error (MSE), 159 metrics that penalize model complexity, 160 161 models of behavior, 139 154 beginning with familiar, 140 144 classification models, 148 154 generalizing to bigger picture, 144 148 regression models, 140 148
refresher on free parameters, 167 root mean squared error (RMSE), 160 variance account, 158 159 Univariable linear regression model, 144 145 Univariable nonlinear regression models, 144t Utility, efficiency and, 6 7
V Variability, 93 94, 99, 243 in applied behavior analysis, 94 96 common measure of, 93 in data, 89 94 aggregate percentage difference from mode, 92 93 consensus, 93 94 variation ratio, 90 92 decomposition models, 230, 235 definition, 229 descriptions of, 75 76, 81, 95 96 descriptors of, 80 81 difference in, 92 filtering, 232 233 functional determinants of, 3 indices of, 81 82 isolating, 235 measures of, 89 90, 93 94, 95t, 119 120 central tendency, 82 89 confidence intervals, 86 89 standard error, 83 86 moving average data, 233 multiplicative model, 237 238 session-to-session, 82 spread of your data, 76 82 benefits and drawbacks of, 80 82 interquartile range, 78 min and max values, 76 range, 76 78 standard deviation, 78 80 Variable-interval (VI) schedules, 140 142 Variance account, 158 159 Variance accounted for (VAC), 158 Variation ratio, 90 92 Verbal behavior, 8, 261 Bayesian statistics, 271 276 description, 275 276 probability theory, 264 269 Verbal stimulus, 136 137 Visual example, 84 85 Visualization, 53, 83, 84f